2024-04-05 03:59:33

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v3 0/3] arm64: tlb: Fix TLBI RANGE operand

A kernel crash on the destination VM after the live migration was
reported by Yihuang Yu. The issue is only reproducible on NVidia's
grace-hopper where TLBI RANGE feature is available. The kernel crash
is caused by incomplete TLB flush and missed dirty page. For the
root cause and analysis, please refer to PATCH[v3 1/3]'s commit log.

Thanks to Marc Zyngier who proposed all the code changes.

PATCH[1] fixes the kernel crash by extending __TLBI_RANGE_NUM() so that
the TLBI RANGE on the area with MAX_TLBI_RANGE_PAGES pages can
be supported
PATCH[2] improves __TLBI_VADDR_RANGE() with masks and FIELD_PREP()
PATCH[3] allows TLBI RANGE operation on the area with MAX_TLBI_RANGE_PAGES
pages in __flush_tlb_range_nosync()

v2: https://lists.infradead.org/pipermail/linux-arm-kernel/2024-April/917432.html
v1: https://lists.infradead.org/pipermail/linux-arm-kernel/2024-April/916972.html

Changelog
=========
v3:
Improve __TLBI_RANGE_NUM() and its comments. Added patches
to improve __TLBI_VADDR_RANGE() and __flush_tlb_range_nosync() (Marc)
v2:
Improve __TLBI_RANGE_NUM() (Marc)

Gavin Shan (3):
arm64: tlb: Fix TLBI RANGE operand
arm64: tlb: Improve __TLBI_VADDR_RANGE()
arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGES

arch/arm64/include/asm/tlbflush.h | 53 ++++++++++++++++++-------------
1 file changed, 31 insertions(+), 22 deletions(-)

--
2.44.0



2024-04-05 03:59:44

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v3 1/3] arm64: tlb: Fix TLBI RANGE operand

KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
pages are collected by VMM and the page table entries become write
protected during live migration. Unfortunately, the operand passed
to the TLBI RANGE instruction isn't correctly sorted out due to the
commit 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()").
It leads to crash on the destination VM after live migration because
TLBs aren't flushed completely and some of the dirty pages are missed.

For example, I have a VM where 8GB memory is assigned, starting from
0x40000000 (1GB). Note that the host has 4KB as the base page size.
In the middile of migration, kvm_tlb_flush_vmid_range() is executed
to flush TLBs. It passes MAX_TLBI_RANGE_PAGES as the argument to
__kvm_tlb_flush_vmid_range() and __flush_s2_tlb_range_op(). SCALE#3
and NUM#31, corresponding to MAX_TLBI_RANGE_PAGES, isn't supported
by __TLBI_RANGE_NUM(). In this specific case, -1 has been returned
from __TLBI_RANGE_NUM() for SCALE#3/2/1/0 and rejected by the loop
in the __flush_tlb_range_op() until the variable @scale underflows
and becomes -9, 0xffff708000040000 is set as the operand. The operand
is wrong since it's sorted out by __TLBI_VADDR_RANGE() according to
invalid @scale and @num.

Fix it by extending __TLBI_RANGE_NUM() to support the combination of
SCALE#3 and NUM#31. With the changes, [-1 31] instead of [-1 30] can
be returned from the macro, meaning the TLBs for 0x200000 pages in the
above example can be flushed in one shoot with SCALE#3 and NUM#31. The
macro TLBI_RANGE_MASK is dropped since no one uses it any more. The
comments are also adjusted accordingly.

Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")
Cc: [email protected] # v6.6+
Reported-by: Yihuang Yu <[email protected]>
Suggested-by: Marc Zyngier <[email protected]>
Signed-off-by: Gavin Shan <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 3b0e8248e1a4..a75de2665d84 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -161,12 +161,18 @@ static inline unsigned long get_trans_granule(void)
#define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)

/*
- * Generate 'num' values from -1 to 30 with -1 rejected by the
- * __flush_tlb_range() loop below.
+ * Generate 'num' values from -1 to 31 with -1 rejected by the
+ * __flush_tlb_range() loop below. Its return value is only
+ * significant for a maximum of MAX_TLBI_RANGE_PAGES pages. If
+ * 'pages' is more than that, you must iterate over the overall
+ * range.
*/
-#define TLBI_RANGE_MASK GENMASK_ULL(4, 0)
-#define __TLBI_RANGE_NUM(pages, scale) \
- ((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
+#define __TLBI_RANGE_NUM(pages, scale) \
+ ({ \
+ int __pages = min((pages), \
+ __TLBI_RANGE_PAGES(31, (scale))); \
+ (__pages >> (5 * (scale) + 1)) - 1; \
+ })

/*
* TLB Invalidation
@@ -379,10 +385,6 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
* 3. If there is 1 page remaining, flush it through non-range operations. Range
* operations can only span an even number of pages. We save this for last to
* ensure 64KB start alignment is maintained for the LPA2 case.
- *
- * Note that certain ranges can be represented by either num = 31 and
- * scale or num = 0 and scale + 1. The loop below favours the latter
- * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
*/
#define __flush_tlb_range_op(op, start, pages, stride, \
asid, tlb_level, tlbi_user, lpa2) \
--
2.44.0


2024-04-05 04:07:59

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v3 2/3] arm64: tlb: Improve __TLBI_VADDR_RANGE()

The macro returns the operand of TLBI RANGE instruction. A mask needs
to be applied to each individual field upon producing the operand, to
avoid the adjacent fields can interfere with each other when invalid
arguments have been provided. The code looks more tidy at least with
a mask and FIELD_PREP().

Suggested-by: Marc Zyngier <[email protected]>
Signed-off-by: Gavin Shan <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index a75de2665d84..243d71f7bc1f 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -142,17 +142,24 @@ static inline unsigned long get_trans_granule(void)
* EL1, Inner Shareable".
*
*/
-#define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \
- ({ \
- unsigned long __ta = (baddr); \
- unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \
- __ta &= GENMASK_ULL(36, 0); \
- __ta |= __ttl << 37; \
- __ta |= (unsigned long)(num) << 39; \
- __ta |= (unsigned long)(scale) << 44; \
- __ta |= get_trans_granule() << 46; \
- __ta |= (unsigned long)(asid) << 48; \
- __ta; \
+#define TLBIR_ASID_MASK GENMASK_ULL(63, 48)
+#define TLBIR_TG_MASK GENMASK_ULL(47, 46)
+#define TLBIR_SCALE_MASK GENMASK_ULL(45, 44)
+#define TLBIR_NUM_MASK GENMASK_ULL(43, 39)
+#define TLBIR_TTL_MASK GENMASK_ULL(38, 37)
+#define TLBIR_BADDR_MASK GENMASK_ULL(36, 0)
+
+#define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \
+ ({ \
+ unsigned long __ta = 0; \
+ unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \
+ __ta |= FIELD_PREP(TLBIR_BADDR_MASK, baddr); \
+ __ta |= FIELD_PREP(TLBIR_TTL_MASK, __ttl); \
+ __ta |= FIELD_PREP(TLBIR_NUM_MASK, num); \
+ __ta |= FIELD_PREP(TLBIR_SCALE_MASK, scale); \
+ __ta |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule()); \
+ __ta |= FIELD_PREP(TLBIR_ASID_MASK, asid); \
+ __ta; \
})

/* These macros are used by the TLBI RANGE feature. */
--
2.44.0


2024-04-05 04:08:06

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v3 3/3] arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGES

MAX_TLBI_RANGE_PAGES pages is covered by SCALE#3 and NUM#31 and it's
supported now. Allow TLBI RANGE operation when the number of pages is
equal to MAX_TLBI_RANGE_PAGES in __flush_tlb_range_nosync().

Suggested-by: Marc Zyngier <[email protected]>
Signed-off-by: Gavin Shan <[email protected]>
---
arch/arm64/include/asm/tlbflush.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 243d71f7bc1f..95fbc8c05607 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -446,11 +446,11 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
* When not uses TLB range ops, we can handle up to
* (MAX_DVM_OPS - 1) pages;
* When uses TLB range ops, we can handle up to
- * (MAX_TLBI_RANGE_PAGES - 1) pages.
+ * MAX_TLBI_RANGE_PAGES pages.
*/
if ((!system_supports_tlb_range() &&
(end - start) >= (MAX_DVM_OPS * stride)) ||
- pages >= MAX_TLBI_RANGE_PAGES) {
+ pages > MAX_TLBI_RANGE_PAGES) {
flush_tlb_mm(vma->vm_mm);
return;
}
--
2.44.0


2024-04-05 17:12:29

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] arm64: tlb: Improve __TLBI_VADDR_RANGE()

On Fri, Apr 05, 2024 at 01:58:51PM +1000, Gavin Shan wrote:
> The macro returns the operand of TLBI RANGE instruction. A mask needs
> to be applied to each individual field upon producing the operand, to
> avoid the adjacent fields can interfere with each other when invalid
> arguments have been provided. The code looks more tidy at least with
> a mask and FIELD_PREP().
>
> Suggested-by: Marc Zyngier <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>

Reviewed-by: Catalin Marinas <[email protected]>

2024-04-05 17:13:30

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGES

On Fri, Apr 05, 2024 at 01:58:52PM +1000, Gavin Shan wrote:
> MAX_TLBI_RANGE_PAGES pages is covered by SCALE#3 and NUM#31 and it's
> supported now. Allow TLBI RANGE operation when the number of pages is
> equal to MAX_TLBI_RANGE_PAGES in __flush_tlb_range_nosync().
>
> Suggested-by: Marc Zyngier <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>

Verified this case as well, so:

Reviewed-by: Catalin Marinas <[email protected]>

2024-04-05 17:22:20

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] arm64: tlb: Fix TLBI RANGE operand

On Fri, Apr 05, 2024 at 01:58:50PM +1000, Gavin Shan wrote:
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 3b0e8248e1a4..a75de2665d84 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -161,12 +161,18 @@ static inline unsigned long get_trans_granule(void)
> #define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)
>
> /*
> - * Generate 'num' values from -1 to 30 with -1 rejected by the
> - * __flush_tlb_range() loop below.
> + * Generate 'num' values from -1 to 31 with -1 rejected by the
> + * __flush_tlb_range() loop below. Its return value is only
> + * significant for a maximum of MAX_TLBI_RANGE_PAGES pages. If
> + * 'pages' is more than that, you must iterate over the overall
> + * range.
> */
> -#define TLBI_RANGE_MASK GENMASK_ULL(4, 0)
> -#define __TLBI_RANGE_NUM(pages, scale) \
> - ((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
> +#define __TLBI_RANGE_NUM(pages, scale) \
> + ({ \
> + int __pages = min((pages), \
> + __TLBI_RANGE_PAGES(31, (scale))); \
> + (__pages >> (5 * (scale) + 1)) - 1; \
> + })

Reviewed-by: Catalin Marinas <[email protected]>

This looks correct to me as well. I spent a bit of time to update an old
CBMC model I had around. With the original __TLBI_RANGE_NUM indeed shows
'scale' becoming negative on the kvm_tlb_flush_vmid_range() path. The
patch above fixes it and it also allows the non-KVM path to use the
range TLBI for MAX_TLBI_RANGE_PAGES (as per patch 3).

FWIW, here's the model:

-----------------------8<--------------------------------------
// SPDX-License-Identifier: GPL-2.0-only
/*
* Check with:
* cbmc --unwind 6 tlbinval.c
*/

#define PAGE_SHIFT (12)
#define PAGE_SIZE (1 << PAGE_SHIFT)
#define VA_RANGE (1UL << 48)
#define SZ_64K 0x00010000

#define __round_mask(x, y) ((__typeof__(x))((y)-1))
#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
#define round_down(x, y) ((x) & ~__round_mask(x, y))

#define min(x, y) (x <= y ? x : y)

#define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (__typeof__(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) __ALIGN_KERNEL((x), (a))

/* only masking out irrelevant bits */
#define __TLBI_RANGE_VADDR(addr, shift) ((addr) & ~((1UL << shift) - 1))
#define __TLBI_VADDR(addr) __TLBI_RANGE_VADDR(addr, PAGE_SHIFT)

#define __TLBI_RANGE_PAGES(num, scale) ((unsigned long)((num) + 1) << (5 * (scale) + 1))
#define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)

#if 0
/* original code */
#define TLBI_RANGE_MASK 0x1fUL
#define __TLBI_RANGE_NUM(pages, scale) \
((((int)(pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
#else
#define __TLBI_RANGE_NUM(pages, scale) \
({ \
int __pages = min((pages), \
__TLBI_RANGE_PAGES(31, (scale))); \
(__pages >> (5 * (scale) + 1)) - 1; \
})
#endif

const static _Bool lpa2 = 1;
const static _Bool kvm = 1;

static unsigned long inval_start;
static unsigned long inval_end;

static void tlbi(unsigned long start, unsigned long size)
{
unsigned long end = start + size;

if (inval_end == 0) {
inval_start = start;
inval_end = end;
return;
}

/* optimal invalidation */
__CPROVER_assert(start >= inval_end || end <= inval_start, "No overlapping TLBI range");

if (start < inval_start) {
__CPROVER_assert(end >= inval_start, "No TLBI range gaps");
inval_start = start;
}
if (end > inval_end) {
__CPROVER_assert(start <= inval_end, "No TLBI range gaps");
inval_end = end;
}
}

static void tlbi_range(unsigned long start, int num, int scale)
{
unsigned long size = __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT;

tlbi(start, size);
}

static void __flush_tlb_range_op(unsigned long start, unsigned long pages,
unsigned long stride)
{
int num = 0;
int scale = 3;
int shift = lpa2 ? 16 : PAGE_SHIFT;
unsigned long addr;

while (pages > 0) {
if (pages == 1 ||
(lpa2 && start != ALIGN(start, SZ_64K))) {
addr = __TLBI_VADDR(start);
tlbi(addr, stride);
start += stride;
pages -= stride >> PAGE_SHIFT;
continue;
}

__CPROVER_assert(scale >= 0 && scale <= 3, "Scale in range");
num = __TLBI_RANGE_NUM(pages, scale);
__CPROVER_assert(num <= 31, "Num in range");
if (num >= 0) {
addr = __TLBI_RANGE_VADDR(start, shift);
tlbi_range(addr, num, scale);
start += __TLBI_RANGE_PAGES(num, scale) << PAGE_SHIFT;
pages -= __TLBI_RANGE_PAGES(num, scale);
}
scale--;
}
}

static void __flush_tlb_range(unsigned long start, unsigned long pages,
unsigned long stride)
{
if (pages > MAX_TLBI_RANGE_PAGES) {
tlbi(0, VA_RANGE);
return;
}

__flush_tlb_range_op(start, pages, stride);
}

void __kvm_tlb_flush_vmid_range(unsigned long start, unsigned long pages)
{
unsigned long stride;

stride = PAGE_SIZE;
start = round_down(start, stride);

__flush_tlb_range_op(start, pages, stride);
}

static void kvm_tlb_flush_vmid_range(unsigned long addr, unsigned long size)
{
unsigned long pages, inval_pages;

pages = size >> PAGE_SHIFT;
while (pages > 0) {
inval_pages = min(pages, MAX_TLBI_RANGE_PAGES);
__kvm_tlb_flush_vmid_range(addr, inval_pages);

addr += inval_pages << PAGE_SHIFT;
pages -= inval_pages;
}
}

static unsigned long nondet_ulong(void);

int main(void)
{
unsigned long stride = nondet_ulong();
unsigned long start = round_down(nondet_ulong(), stride);
unsigned long end = round_up(nondet_ulong(), stride);
unsigned long pages = (end - start) >> PAGE_SHIFT;

__CPROVER_assume(stride == PAGE_SIZE ||
stride == PAGE_SIZE << (PAGE_SHIFT - 3) ||
stride == PAGE_SIZE << (2 * (PAGE_SHIFT - 3)));
__CPROVER_assume(start < end);
__CPROVER_assume(end <= VA_RANGE);

if (kvm)
kvm_tlb_flush_vmid_range(start, pages << PAGE_SHIFT);
else
__flush_tlb_range(start, pages, stride);

__CPROVER_assert((inval_start == 0 && inval_end == VA_RANGE) ||
(inval_start == start && inval_end == end),
"Correct invalidation");

return 0;
}

2024-04-08 08:39:33

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] arm64: tlb: Fix TLBI RANGE operand

On 05/04/2024 04:58, Gavin Shan wrote:
> KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
> pages are collected by VMM and the page table entries become write
> protected during live migration. Unfortunately, the operand passed
> to the TLBI RANGE instruction isn't correctly sorted out due to the
> commit 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()").
> It leads to crash on the destination VM after live migration because
> TLBs aren't flushed completely and some of the dirty pages are missed.
>
> For example, I have a VM where 8GB memory is assigned, starting from
> 0x40000000 (1GB). Note that the host has 4KB as the base page size.
> In the middile of migration, kvm_tlb_flush_vmid_range() is executed
> to flush TLBs. It passes MAX_TLBI_RANGE_PAGES as the argument to
> __kvm_tlb_flush_vmid_range() and __flush_s2_tlb_range_op(). SCALE#3
> and NUM#31, corresponding to MAX_TLBI_RANGE_PAGES, isn't supported
> by __TLBI_RANGE_NUM(). In this specific case, -1 has been returned
> from __TLBI_RANGE_NUM() for SCALE#3/2/1/0 and rejected by the loop
> in the __flush_tlb_range_op() until the variable @scale underflows
> and becomes -9, 0xffff708000040000 is set as the operand. The operand
> is wrong since it's sorted out by __TLBI_VADDR_RANGE() according to
> invalid @scale and @num.
>
> Fix it by extending __TLBI_RANGE_NUM() to support the combination of
> SCALE#3 and NUM#31. With the changes, [-1 31] instead of [-1 30] can
> be returned from the macro, meaning the TLBs for 0x200000 pages in the
> above example can be flushed in one shoot with SCALE#3 and NUM#31. The
> macro TLBI_RANGE_MASK is dropped since no one uses it any more. The
> comments are also adjusted accordingly.

Perhaps I'm being overly pedantic, but I don't think the bug is
__TLBI_RANGE_NUM() not being able to return 31; It is clearly documented that it
can only return in the range [-1, 30] and a maximum of (MAX_TLBI_RANGE_PAGES -
1) pages are supported.

The bug is in the kvm caller, which tries to call __flush_tlb_range_op() with
MAX_TLBI_RANGE_PAGES; clearly out-of-bounds.

So personally, I would prefer to fix the bug first. Then separately enhance the
infrastructure to support NUM=31.

>
> Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")

I would argue that the bug was actually introduced by commit 360839027a6e
("arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range"), which
separated the tlbi loop from the range size validation in __flush_tlb_range().
Before this, all calls would have to go through __flush_tlb_range() and
therefore anything bigger than (MAX_TLBI_RANGE_PAGES - 1) pages would cause the
whole mm to be flushed. Although I get that bisect will lead to this one, so
that's probably the right one to highlight.

I get why it was split, but perhaps it should have been split at a higher level;
If tlbi range is not supported, then KVM will flush the whole vmid. Would it be
better for KVM to follow the same pattern as __flush_tlb_range_nosync() and
issue per-block tlbis upto a max of MAX_DVM_OPS before falling back to the whole
vmid? And if tlbi range is supported, KVM uses it regardless of the size of the
range, whereas __flush_tlb_range_nosync() falls back to flush_tlb_mm() at a
certain size. It's not clear why this divergence is useful?



> Cc: [email protected] # v6.6+
> Reported-by: Yihuang Yu <[email protected]>
> Suggested-by: Marc Zyngier <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>

Anyway, the implementation looks correct, so:

Reviewed-by: Ryan Roberts <[email protected]>


> ---
> arch/arm64/include/asm/tlbflush.h | 20 +++++++++++---------
> 1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 3b0e8248e1a4..a75de2665d84 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -161,12 +161,18 @@ static inline unsigned long get_trans_granule(void)
> #define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)
>
> /*
> - * Generate 'num' values from -1 to 30 with -1 rejected by the
> - * __flush_tlb_range() loop below.
> + * Generate 'num' values from -1 to 31 with -1 rejected by the
> + * __flush_tlb_range() loop below. Its return value is only
> + * significant for a maximum of MAX_TLBI_RANGE_PAGES pages. If
> + * 'pages' is more than that, you must iterate over the overall
> + * range.
> */
> -#define TLBI_RANGE_MASK GENMASK_ULL(4, 0)
> -#define __TLBI_RANGE_NUM(pages, scale) \
> - ((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
> +#define __TLBI_RANGE_NUM(pages, scale) \
> + ({ \
> + int __pages = min((pages), \
> + __TLBI_RANGE_PAGES(31, (scale))); \
> + (__pages >> (5 * (scale) + 1)) - 1; \
> + })
>
> /*
> * TLB Invalidation
> @@ -379,10 +385,6 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> * 3. If there is 1 page remaining, flush it through non-range operations. Range
> * operations can only span an even number of pages. We save this for last to
> * ensure 64KB start alignment is maintained for the LPA2 case.
> - *
> - * Note that certain ranges can be represented by either num = 31 and
> - * scale or num = 0 and scale + 1. The loop below favours the latter
> - * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
> */
> #define __flush_tlb_range_op(op, start, pages, stride, \
> asid, tlb_level, tlbi_user, lpa2) \


2024-04-08 08:40:18

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] arm64: tlb: Improve __TLBI_VADDR_RANGE()

On 05/04/2024 04:58, Gavin Shan wrote:
> The macro returns the operand of TLBI RANGE instruction. A mask needs
> to be applied to each individual field upon producing the operand, to
> avoid the adjacent fields can interfere with each other when invalid
> arguments have been provided. The code looks more tidy at least with
> a mask and FIELD_PREP().
>
> Suggested-by: Marc Zyngier <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>

LGTM!

Reviewed-by: Ryan Roberts <[email protected]>

> ---
> arch/arm64/include/asm/tlbflush.h | 29 ++++++++++++++++++-----------
> 1 file changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index a75de2665d84..243d71f7bc1f 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -142,17 +142,24 @@ static inline unsigned long get_trans_granule(void)
> * EL1, Inner Shareable".
> *
> */
> -#define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \
> - ({ \
> - unsigned long __ta = (baddr); \
> - unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \
> - __ta &= GENMASK_ULL(36, 0); \
> - __ta |= __ttl << 37; \
> - __ta |= (unsigned long)(num) << 39; \
> - __ta |= (unsigned long)(scale) << 44; \
> - __ta |= get_trans_granule() << 46; \
> - __ta |= (unsigned long)(asid) << 48; \
> - __ta; \
> +#define TLBIR_ASID_MASK GENMASK_ULL(63, 48)
> +#define TLBIR_TG_MASK GENMASK_ULL(47, 46)
> +#define TLBIR_SCALE_MASK GENMASK_ULL(45, 44)
> +#define TLBIR_NUM_MASK GENMASK_ULL(43, 39)
> +#define TLBIR_TTL_MASK GENMASK_ULL(38, 37)
> +#define TLBIR_BADDR_MASK GENMASK_ULL(36, 0)
> +
> +#define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \
> + ({ \
> + unsigned long __ta = 0; \
> + unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \
> + __ta |= FIELD_PREP(TLBIR_BADDR_MASK, baddr); \
> + __ta |= FIELD_PREP(TLBIR_TTL_MASK, __ttl); \
> + __ta |= FIELD_PREP(TLBIR_NUM_MASK, num); \
> + __ta |= FIELD_PREP(TLBIR_SCALE_MASK, scale); \
> + __ta |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule()); \
> + __ta |= FIELD_PREP(TLBIR_ASID_MASK, asid); \
> + __ta; \
> })
>
> /* These macros are used by the TLBI RANGE feature. */


2024-04-08 08:44:02

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGES

On 05/04/2024 04:58, Gavin Shan wrote:
> MAX_TLBI_RANGE_PAGES pages is covered by SCALE#3 and NUM#31 and it's
> supported now. Allow TLBI RANGE operation when the number of pages is
> equal to MAX_TLBI_RANGE_PAGES in __flush_tlb_range_nosync().
>
> Suggested-by: Marc Zyngier <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>

Reviewed-by: Ryan Roberts <[email protected]>

> ---
> arch/arm64/include/asm/tlbflush.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 243d71f7bc1f..95fbc8c05607 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -446,11 +446,11 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
> * When not uses TLB range ops, we can handle up to
> * (MAX_DVM_OPS - 1) pages;
> * When uses TLB range ops, we can handle up to
> - * (MAX_TLBI_RANGE_PAGES - 1) pages.
> + * MAX_TLBI_RANGE_PAGES pages.
> */
> if ((!system_supports_tlb_range() &&
> (end - start) >= (MAX_DVM_OPS * stride)) ||
> - pages >= MAX_TLBI_RANGE_PAGES) {
> + pages > MAX_TLBI_RANGE_PAGES) {

As a further enhancement, I wonder if it might be better to test:

pages * 4 / MAX_TLBI_RANGE_PAGES > MAX_DVM_OPS

Then add an extra loop over __flush_tlb_range_op(), like KVM does.

The math is trying to express that there are a maximum of 4 tlbi range
instructions for MAX_TLBI_RANGE_PAGES pages (1 per scale) and we only need to
fall back to flushing the whole mm if it could generate more than MAX_DVM_OPS ops.

> flush_tlb_mm(vma->vm_mm);
> return;
> }


2024-04-10 07:56:23

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] arm64: tlb: Fix TLBI RANGE operand



On 4/5/24 09:28, Gavin Shan wrote:
> KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
> pages are collected by VMM and the page table entries become write
> protected during live migration. Unfortunately, the operand passed
> to the TLBI RANGE instruction isn't correctly sorted out due to the
> commit 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()").
> It leads to crash on the destination VM after live migration because
> TLBs aren't flushed completely and some of the dirty pages are missed.
>
> For example, I have a VM where 8GB memory is assigned, starting from
> 0x40000000 (1GB). Note that the host has 4KB as the base page size.
> In the middile of migration, kvm_tlb_flush_vmid_range() is executed
> to flush TLBs. It passes MAX_TLBI_RANGE_PAGES as the argument to
> __kvm_tlb_flush_vmid_range() and __flush_s2_tlb_range_op(). SCALE#3
> and NUM#31, corresponding to MAX_TLBI_RANGE_PAGES, isn't supported
> by __TLBI_RANGE_NUM(). In this specific case, -1 has been returned
> from __TLBI_RANGE_NUM() for SCALE#3/2/1/0 and rejected by the loop
> in the __flush_tlb_range_op() until the variable @scale underflows
> and becomes -9, 0xffff708000040000 is set as the operand. The operand
> is wrong since it's sorted out by __TLBI_VADDR_RANGE() according to
> invalid @scale and @num.
>
> Fix it by extending __TLBI_RANGE_NUM() to support the combination of
> SCALE#3 and NUM#31. With the changes, [-1 31] instead of [-1 30] can
> be returned from the macro, meaning the TLBs for 0x200000 pages in the
> above example can be flushed in one shoot with SCALE#3 and NUM#31. The
> macro TLBI_RANGE_MASK is dropped since no one uses it any more. The
> comments are also adjusted accordingly.
>
> Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")
> Cc: [email protected] # v6.6+
> Reported-by: Yihuang Yu <[email protected]>
> Suggested-by: Marc Zyngier <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>

Reviewed-by: Anshuman Khandual <[email protected]>

> ---
> arch/arm64/include/asm/tlbflush.h | 20 +++++++++++---------
> 1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 3b0e8248e1a4..a75de2665d84 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -161,12 +161,18 @@ static inline unsigned long get_trans_granule(void)
> #define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)
>
> /*
> - * Generate 'num' values from -1 to 30 with -1 rejected by the
> - * __flush_tlb_range() loop below.
> + * Generate 'num' values from -1 to 31 with -1 rejected by the
> + * __flush_tlb_range() loop below. Its return value is only
> + * significant for a maximum of MAX_TLBI_RANGE_PAGES pages. If
> + * 'pages' is more than that, you must iterate over the overall
> + * range.
> */
> -#define TLBI_RANGE_MASK GENMASK_ULL(4, 0)
> -#define __TLBI_RANGE_NUM(pages, scale) \
> - ((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
> +#define __TLBI_RANGE_NUM(pages, scale) \
> + ({ \
> + int __pages = min((pages), \
> + __TLBI_RANGE_PAGES(31, (scale))); \
> + (__pages >> (5 * (scale) + 1)) - 1; \
> + })
>
> /*
> * TLB Invalidation
> @@ -379,10 +385,6 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> * 3. If there is 1 page remaining, flush it through non-range operations. Range
> * operations can only span an even number of pages. We save this for last to
> * ensure 64KB start alignment is maintained for the LPA2 case.
> - *
> - * Note that certain ranges can be represented by either num = 31 and
> - * scale or num = 0 and scale + 1. The loop below favours the latter
> - * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
> */
> #define __flush_tlb_range_op(op, start, pages, stride, \
> asid, tlb_level, tlbi_user, lpa2) \

2024-04-10 07:58:08

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] arm64: tlb: Improve __TLBI_VADDR_RANGE()



On 4/5/24 09:28, Gavin Shan wrote:
> The macro returns the operand of TLBI RANGE instruction. A mask needs
> to be applied to each individual field upon producing the operand, to
> avoid the adjacent fields can interfere with each other when invalid
> arguments have been provided. The code looks more tidy at least with
> a mask and FIELD_PREP().
>
> Suggested-by: Marc Zyngier <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>

This looks much better.

Reviewed-by: Anshuman Khandual <[email protected]>

> ---
> arch/arm64/include/asm/tlbflush.h | 29 ++++++++++++++++++-----------
> 1 file changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index a75de2665d84..243d71f7bc1f 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -142,17 +142,24 @@ static inline unsigned long get_trans_granule(void)
> * EL1, Inner Shareable".
> *
> */
> -#define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \
> - ({ \
> - unsigned long __ta = (baddr); \
> - unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \
> - __ta &= GENMASK_ULL(36, 0); \
> - __ta |= __ttl << 37; \
> - __ta |= (unsigned long)(num) << 39; \
> - __ta |= (unsigned long)(scale) << 44; \
> - __ta |= get_trans_granule() << 46; \
> - __ta |= (unsigned long)(asid) << 48; \
> - __ta; \
> +#define TLBIR_ASID_MASK GENMASK_ULL(63, 48)
> +#define TLBIR_TG_MASK GENMASK_ULL(47, 46)
> +#define TLBIR_SCALE_MASK GENMASK_ULL(45, 44)
> +#define TLBIR_NUM_MASK GENMASK_ULL(43, 39)
> +#define TLBIR_TTL_MASK GENMASK_ULL(38, 37)
> +#define TLBIR_BADDR_MASK GENMASK_ULL(36, 0)
> +
> +#define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \
> + ({ \
> + unsigned long __ta = 0; \
> + unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \
> + __ta |= FIELD_PREP(TLBIR_BADDR_MASK, baddr); \
> + __ta |= FIELD_PREP(TLBIR_TTL_MASK, __ttl); \
> + __ta |= FIELD_PREP(TLBIR_NUM_MASK, num); \
> + __ta |= FIELD_PREP(TLBIR_SCALE_MASK, scale); \
> + __ta |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule()); \
> + __ta |= FIELD_PREP(TLBIR_ASID_MASK, asid); \
> + __ta; \
> })
>
> /* These macros are used by the TLBI RANGE feature. */

2024-04-10 07:58:51

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGES

On 4/5/24 09:28, Gavin Shan wrote:
> MAX_TLBI_RANGE_PAGES pages is covered by SCALE#3 and NUM#31 and it's
> supported now. Allow TLBI RANGE operation when the number of pages is
> equal to MAX_TLBI_RANGE_PAGES in __flush_tlb_range_nosync().
>
> Suggested-by: Marc Zyngier <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>

Reviewed-by: Anshuman Khandual <[email protected]>

> ---
> arch/arm64/include/asm/tlbflush.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 243d71f7bc1f..95fbc8c05607 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -446,11 +446,11 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
> * When not uses TLB range ops, we can handle up to
> * (MAX_DVM_OPS - 1) pages;
> * When uses TLB range ops, we can handle up to
> - * (MAX_TLBI_RANGE_PAGES - 1) pages.
> + * MAX_TLBI_RANGE_PAGES pages.
> */
> if ((!system_supports_tlb_range() &&
> (end - start) >= (MAX_DVM_OPS * stride)) ||
> - pages >= MAX_TLBI_RANGE_PAGES) {
> + pages > MAX_TLBI_RANGE_PAGES) {
> flush_tlb_mm(vma->vm_mm);
> return;
> }

2024-04-10 08:45:10

by Shaoqin Huang

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] arm64: tlb: Fix TLBI RANGE operand



On 4/5/24 11:58, Gavin Shan wrote:
> A kernel crash on the destination VM after the live migration was
> reported by Yihuang Yu. The issue is only reproducible on NVidia's
> grace-hopper where TLBI RANGE feature is available. The kernel crash
> is caused by incomplete TLB flush and missed dirty page. For the
> root cause and analysis, please refer to PATCH[v3 1/3]'s commit log.
>
> Thanks to Marc Zyngier who proposed all the code changes.
>
> PATCH[1] fixes the kernel crash by extending __TLBI_RANGE_NUM() so that
> the TLBI RANGE on the area with MAX_TLBI_RANGE_PAGES pages can
> be supported
> PATCH[2] improves __TLBI_VADDR_RANGE() with masks and FIELD_PREP()
> PATCH[3] allows TLBI RANGE operation on the area with MAX_TLBI_RANGE_PAGES
> pages in __flush_tlb_range_nosync()
>
> v2: https://lists.infradead.org/pipermail/linux-arm-kernel/2024-April/917432.html
> v1: https://lists.infradead.org/pipermail/linux-arm-kernel/2024-April/916972.html
>
> Changelog
> =========
> v3:
> Improve __TLBI_RANGE_NUM() and its comments. Added patches
> to improve __TLBI_VADDR_RANGE() and __flush_tlb_range_nosync() (Marc)
> v2:
> Improve __TLBI_RANGE_NUM() (Marc)
>
> Gavin Shan (3):
> arm64: tlb: Fix TLBI RANGE operand
> arm64: tlb: Improve __TLBI_VADDR_RANGE()
> arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGES
>
> arch/arm64/include/asm/tlbflush.h | 53 ++++++++++++++++++-------------
> 1 file changed, 31 insertions(+), 22 deletions(-)
>

For the series.

Reviewed-by: Shaoqin Huang <[email protected]>

--
Shaoqin


2024-04-10 08:46:42

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] arm64: tlb: Fix TLBI RANGE operand

On Mon, 08 Apr 2024 09:29:31 +0100,
Ryan Roberts <[email protected]> wrote:
>
> On 05/04/2024 04:58, Gavin Shan wrote:
> > KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
> > pages are collected by VMM and the page table entries become write
> > protected during live migration. Unfortunately, the operand passed
> > to the TLBI RANGE instruction isn't correctly sorted out due to the
> > commit 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()").
> > It leads to crash on the destination VM after live migration because
> > TLBs aren't flushed completely and some of the dirty pages are missed.
> >
> > For example, I have a VM where 8GB memory is assigned, starting from
> > 0x40000000 (1GB). Note that the host has 4KB as the base page size.
> > In the middile of migration, kvm_tlb_flush_vmid_range() is executed
> > to flush TLBs. It passes MAX_TLBI_RANGE_PAGES as the argument to
> > __kvm_tlb_flush_vmid_range() and __flush_s2_tlb_range_op(). SCALE#3
> > and NUM#31, corresponding to MAX_TLBI_RANGE_PAGES, isn't supported
> > by __TLBI_RANGE_NUM(). In this specific case, -1 has been returned
> > from __TLBI_RANGE_NUM() for SCALE#3/2/1/0 and rejected by the loop
> > in the __flush_tlb_range_op() until the variable @scale underflows
> > and becomes -9, 0xffff708000040000 is set as the operand. The operand
> > is wrong since it's sorted out by __TLBI_VADDR_RANGE() according to
> > invalid @scale and @num.
> >
> > Fix it by extending __TLBI_RANGE_NUM() to support the combination of
> > SCALE#3 and NUM#31. With the changes, [-1 31] instead of [-1 30] can
> > be returned from the macro, meaning the TLBs for 0x200000 pages in the
> > above example can be flushed in one shoot with SCALE#3 and NUM#31. The
> > macro TLBI_RANGE_MASK is dropped since no one uses it any more. The
> > comments are also adjusted accordingly.
>
> Perhaps I'm being overly pedantic, but I don't think the bug is
> __TLBI_RANGE_NUM() not being able to return 31; It is clearly documented that it
> can only return in the range [-1, 30] and a maximum of (MAX_TLBI_RANGE_PAGES -
> 1) pages are supported.

I guess "clearly" is pretty relative. I find it misleading that we
don't support the full range of what the architecture offers and have
these odd limitations.

> The bug is in the kvm caller, which tries to call __flush_tlb_range_op() with
> MAX_TLBI_RANGE_PAGES; clearly out-of-bounds.

Nobody disputes that point, hence the Fixes: tag pointing to the KVM
patch. But there are two ways to fix it: either reduce the amount KVM
can use for range invalidation, or fix the helper to allow the full
range offered by the architecture.

> So personally, I would prefer to fix the bug first. Then separately
> enhance the infrastructure to support NUM=31.

I don't think this buys us much, apart from making it harder for
people to know what they need/want/randomly-elect to backport.

> > Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")
>
> I would argue that the bug was actually introduced by commit 360839027a6e
> ("arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range"), which
> separated the tlbi loop from the range size validation in __flush_tlb_range().
> Before this, all calls would have to go through __flush_tlb_range() and
> therefore anything bigger than (MAX_TLBI_RANGE_PAGES - 1) pages would cause the
> whole mm to be flushed. Although I get that bisect will lead to this one, so
> that's probably the right one to highlight.

I haven't tried to bisect it, only picked this as the obviously
culprit.

To your point, using __flush_tlb_range() made little sense for KVM --
what would be the vma here? Splitting the helper out was, I think the
correct decision. But we of course lost sight of the __TLBI_RANGE_NUM
limitation in the process.

> I get why it was split, but perhaps it should have been split at a higher level;
> If tlbi range is not supported, then KVM will flush the whole vmid. Would it be
> better for KVM to follow the same pattern as __flush_tlb_range_nosync() and
> issue per-block tlbis upto a max of MAX_DVM_OPS before falling back to the whole
> vmid? And if tlbi range is supported, KVM uses it regardless of the size of the
> range, whereas __flush_tlb_range_nosync() falls back to flush_tlb_mm() at a
> certain size. It's not clear why this divergence is useful?

That'd be a definitive improvement indeed, and would bring back some
much needed consistency.

> > Cc: [email protected] # v6.6+
> > Reported-by: Yihuang Yu <[email protected]>
> > Suggested-by: Marc Zyngier <[email protected]>
> > Signed-off-by: Gavin Shan <[email protected]>
>
> Anyway, the implementation looks correct, so:
>
> Reviewed-by: Ryan Roberts <[email protected]>

Thanks for that!

M.

--
Without deviation from the norm, progress is not possible.

2024-04-10 08:50:31

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGES

On Mon, 08 Apr 2024 09:43:44 +0100,
Ryan Roberts <[email protected]> wrote:
>
> On 05/04/2024 04:58, Gavin Shan wrote:
> > MAX_TLBI_RANGE_PAGES pages is covered by SCALE#3 and NUM#31 and it's
> > supported now. Allow TLBI RANGE operation when the number of pages is
> > equal to MAX_TLBI_RANGE_PAGES in __flush_tlb_range_nosync().
> >
> > Suggested-by: Marc Zyngier <[email protected]>
> > Signed-off-by: Gavin Shan <[email protected]>
>
> Reviewed-by: Ryan Roberts <[email protected]>
>
> > ---
> > arch/arm64/include/asm/tlbflush.h | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > index 243d71f7bc1f..95fbc8c05607 100644
> > --- a/arch/arm64/include/asm/tlbflush.h
> > +++ b/arch/arm64/include/asm/tlbflush.h
> > @@ -446,11 +446,11 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
> > * When not uses TLB range ops, we can handle up to
> > * (MAX_DVM_OPS - 1) pages;
> > * When uses TLB range ops, we can handle up to
> > - * (MAX_TLBI_RANGE_PAGES - 1) pages.
> > + * MAX_TLBI_RANGE_PAGES pages.
> > */
> > if ((!system_supports_tlb_range() &&
> > (end - start) >= (MAX_DVM_OPS * stride)) ||
> > - pages >= MAX_TLBI_RANGE_PAGES) {
> > + pages > MAX_TLBI_RANGE_PAGES) {
>
> As a further enhancement, I wonder if it might be better to test:
>
> pages * 4 / MAX_TLBI_RANGE_PAGES > MAX_DVM_OPS
>
> Then add an extra loop over __flush_tlb_range_op(), like KVM does.
>
> The math is trying to express that there are a maximum of 4 tlbi range
> instructions for MAX_TLBI_RANGE_PAGES pages (1 per scale) and we only need to
> fall back to flushing the whole mm if it could generate more than MAX_DVM_OPS ops.

That'd be a good enhancement indeed, although I wonder if that occurs
as often as we see it on the KVM side. But in any case, adding
consistency amongst the users of __flush_tlb_range_op() can only be
beneficial.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2024-04-10 17:52:40

by Catalin Marinas

[permalink] [raw]
Subject: Re: (subset) [PATCH v3 0/3] arm64: tlb: Fix TLBI RANGE operand

On Fri, 05 Apr 2024 13:58:49 +1000, Gavin Shan wrote:
> A kernel crash on the destination VM after the live migration was
> reported by Yihuang Yu. The issue is only reproducible on NVidia's
> grace-hopper where TLBI RANGE feature is available. The kernel crash
> is caused by incomplete TLB flush and missed dirty page. For the
> root cause and analysis, please refer to PATCH[v3 1/3]'s commit log.
>
> Thanks to Marc Zyngier who proposed all the code changes.
>
> [...]

Applied to arm64 (for-next/fixes), thanks!

[1/3] arm64: tlb: Fix TLBI RANGE operand
https://git.kernel.org/arm64/c/e3ba51ab24fd

--
Catalin


2024-04-11 10:10:40

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] arm64: tlb: Fix TLBI RANGE operand

On 10/04/2024 09:45, Marc Zyngier wrote:
> On Mon, 08 Apr 2024 09:29:31 +0100,
> Ryan Roberts <[email protected]> wrote:
>>
>> On 05/04/2024 04:58, Gavin Shan wrote:
>>> KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
>>> pages are collected by VMM and the page table entries become write
>>> protected during live migration. Unfortunately, the operand passed
>>> to the TLBI RANGE instruction isn't correctly sorted out due to the
>>> commit 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()").
>>> It leads to crash on the destination VM after live migration because
>>> TLBs aren't flushed completely and some of the dirty pages are missed.
>>>
>>> For example, I have a VM where 8GB memory is assigned, starting from
>>> 0x40000000 (1GB). Note that the host has 4KB as the base page size.
>>> In the middile of migration, kvm_tlb_flush_vmid_range() is executed
>>> to flush TLBs. It passes MAX_TLBI_RANGE_PAGES as the argument to
>>> __kvm_tlb_flush_vmid_range() and __flush_s2_tlb_range_op(). SCALE#3
>>> and NUM#31, corresponding to MAX_TLBI_RANGE_PAGES, isn't supported
>>> by __TLBI_RANGE_NUM(). In this specific case, -1 has been returned
>>> from __TLBI_RANGE_NUM() for SCALE#3/2/1/0 and rejected by the loop
>>> in the __flush_tlb_range_op() until the variable @scale underflows
>>> and becomes -9, 0xffff708000040000 is set as the operand. The operand
>>> is wrong since it's sorted out by __TLBI_VADDR_RANGE() according to
>>> invalid @scale and @num.
>>>
>>> Fix it by extending __TLBI_RANGE_NUM() to support the combination of
>>> SCALE#3 and NUM#31. With the changes, [-1 31] instead of [-1 30] can
>>> be returned from the macro, meaning the TLBs for 0x200000 pages in the
>>> above example can be flushed in one shoot with SCALE#3 and NUM#31. The
>>> macro TLBI_RANGE_MASK is dropped since no one uses it any more. The
>>> comments are also adjusted accordingly.
>>
>> Perhaps I'm being overly pedantic, but I don't think the bug is
>> __TLBI_RANGE_NUM() not being able to return 31; It is clearly documented that it
>> can only return in the range [-1, 30] and a maximum of (MAX_TLBI_RANGE_PAGES -
>> 1) pages are supported.
>
> I guess "clearly" is pretty relative. I find it misleading that we
> don't support the full range of what the architecture offers and have
> these odd limitations.
>
>> The bug is in the kvm caller, which tries to call __flush_tlb_range_op() with
>> MAX_TLBI_RANGE_PAGES; clearly out-of-bounds.
>
> Nobody disputes that point, hence the Fixes: tag pointing to the KVM
> patch. But there are two ways to fix it: either reduce the amount KVM
> can use for range invalidation, or fix the helper to allow the full
> range offered by the architecture.
>
>> So personally, I would prefer to fix the bug first. Then separately
>> enhance the infrastructure to support NUM=31.
>
> I don't think this buys us much, apart from making it harder for
> people to know what they need/want/randomly-elect to backport.

Yeah fair enough. Like I said, I'm being pedantic. And the final state
(supporting NUM=31) is clearly the right outcome. So as long as there is no risk
that backporting the enhancement triggers other bugs, then this is fine for me.

>
>>> Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")
>>
>> I would argue that the bug was actually introduced by commit 360839027a6e
>> ("arm64: tlb: Refactor the core flush algorithm of __flush_tlb_range"), which
>> separated the tlbi loop from the range size validation in __flush_tlb_range().
>> Before this, all calls would have to go through __flush_tlb_range() and
>> therefore anything bigger than (MAX_TLBI_RANGE_PAGES - 1) pages would cause the
>> whole mm to be flushed. Although I get that bisect will lead to this one, so
>> that's probably the right one to highlight.
>
> I haven't tried to bisect it, only picked this as the obviously
> culprit.
>
> To your point, using __flush_tlb_range() made little sense for KVM --
> what would be the vma here? Splitting the helper out was, I think the
> correct decision. But we of course lost sight of the __TLBI_RANGE_NUM
> limitation in the process.

Indeed. I'm just wondering whether it could have been factored out so that the
size check and the decision about full invalidate vs per-va/range invalidate
remained in the common code.

I'm not volunteering to make that change though. So I guess what we have now is
good enough for the time being.

Thanks,
Ryan

>
>> I get why it was split, but perhaps it should have been split at a higher level;
>> If tlbi range is not supported, then KVM will flush the whole vmid. Would it be
>> better for KVM to follow the same pattern as __flush_tlb_range_nosync() and
>> issue per-block tlbis upto a max of MAX_DVM_OPS before falling back to the whole
>> vmid? And if tlbi range is supported, KVM uses it regardless of the size of the
>> range, whereas __flush_tlb_range_nosync() falls back to flush_tlb_mm() at a
>> certain size. It's not clear why this divergence is useful?
>
> That'd be a definitive improvement indeed, and would bring back some
> much needed consistency.
>
>>> Cc: [email protected] # v6.6+
>>> Reported-by: Yihuang Yu <[email protected]>
>>> Suggested-by: Marc Zyngier <[email protected]>
>>> Signed-off-by: Gavin Shan <[email protected]>
>>
>> Anyway, the implementation looks correct, so:
>>
>> Reviewed-by: Ryan Roberts <[email protected]>
>
> Thanks for that!
>
> M.
>


2024-04-11 10:47:03

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGES

On Wed, Apr 10, 2024 at 09:50:20AM +0100, Marc Zyngier wrote:
> On Mon, 08 Apr 2024 09:43:44 +0100,
> Ryan Roberts <[email protected]> wrote:
> >
> > On 05/04/2024 04:58, Gavin Shan wrote:
> > > MAX_TLBI_RANGE_PAGES pages is covered by SCALE#3 and NUM#31 and it's
> > > supported now. Allow TLBI RANGE operation when the number of pages is
> > > equal to MAX_TLBI_RANGE_PAGES in __flush_tlb_range_nosync().
> > >
> > > Suggested-by: Marc Zyngier <[email protected]>
> > > Signed-off-by: Gavin Shan <[email protected]>
> >
> > Reviewed-by: Ryan Roberts <[email protected]>
> >
> > > ---
> > > arch/arm64/include/asm/tlbflush.h | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> > > index 243d71f7bc1f..95fbc8c05607 100644
> > > --- a/arch/arm64/include/asm/tlbflush.h
> > > +++ b/arch/arm64/include/asm/tlbflush.h
> > > @@ -446,11 +446,11 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
> > > * When not uses TLB range ops, we can handle up to
> > > * (MAX_DVM_OPS - 1) pages;
> > > * When uses TLB range ops, we can handle up to
> > > - * (MAX_TLBI_RANGE_PAGES - 1) pages.
> > > + * MAX_TLBI_RANGE_PAGES pages.
> > > */
> > > if ((!system_supports_tlb_range() &&
> > > (end - start) >= (MAX_DVM_OPS * stride)) ||
> > > - pages >= MAX_TLBI_RANGE_PAGES) {
> > > + pages > MAX_TLBI_RANGE_PAGES) {
> >
> > As a further enhancement, I wonder if it might be better to test:
> >
> > pages * 4 / MAX_TLBI_RANGE_PAGES > MAX_DVM_OPS
> >
> > Then add an extra loop over __flush_tlb_range_op(), like KVM does.
> >
> > The math is trying to express that there are a maximum of 4 tlbi range
> > instructions for MAX_TLBI_RANGE_PAGES pages (1 per scale) and we only need to
> > fall back to flushing the whole mm if it could generate more than MAX_DVM_OPS ops.
>
> That'd be a good enhancement indeed, although I wonder if that occurs
> as often as we see it on the KVM side. But in any case, adding
> consistency amongst the users of __flush_tlb_range_op() can only be
> beneficial.

I'll pick patches 2 & 3 up for 6.10, but feel free to send stuff on top
if you want to tweak this.

Will

2024-04-12 16:18:27

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] arm64: tlb: Fix TLBI RANGE operand

On Fri, 05 Apr 2024 13:58:49 +1000, Gavin Shan wrote:
> A kernel crash on the destination VM after the live migration was
> reported by Yihuang Yu. The issue is only reproducible on NVidia's
> grace-hopper where TLBI RANGE feature is available. The kernel crash
> is caused by incomplete TLB flush and missed dirty page. For the
> root cause and analysis, please refer to PATCH[v3 1/3]'s commit log.
>
> Thanks to Marc Zyngier who proposed all the code changes.
>
> [...]

Applied to arm64 (for-next/fixes), thanks!

[1/3] arm64: tlb: Fix TLBI RANGE operand
(no commit info)
[2/3] arm64: tlb: Improve __TLBI_VADDR_RANGE()
https://git.kernel.org/arm64/c/e07255d69702
[3/3] arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGES
https://git.kernel.org/arm64/c/73301e464a72

Cheers,
--
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev