2022-11-11 08:11:46

by Guo Ren

[permalink] [raw]
Subject: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

From: Guo Ren <[email protected]>

After use_asid_allocator is enabled, the userspace application will
crash by stale TLB entries. Because only using cpumask_clear_cpu without
local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
Then set_mm_asid would cause the user space application to get a stale
value by stale TLB entry, but set_mm_noasid is okay.

Here is the symptom of the bug:
unhandled signal 11 code 0x1 (coredump)
0x0000003fd6d22524 <+4>: auipc s0,0x70
0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
=> 0x0000003fd6d2252c <+12>: ld a5,0(s0)
(gdb) i r s0
s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
(gdb) x /2x 0x3fd6d92490
0x3fd6d92490: 0xd80ac8a8 0x0000003f
The core dump file shows that register s0 is wrong, but the value in
memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
in TLB and got a wrong result from an incorrect physical address.

When the task ran on CPU0, which loaded/speculative-loaded the value of
address(0x3fd6d92490), then the first version of the mapping entry was
PTWed into CPU0's TLB.
When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
asid), it happened to write a value on the address (0x3fd6d92490). It
caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
ptep_get_and_clear & flush_tlb_page.
The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
switch_mm. So we only flushed the CPU1 TLB and set the second version
mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
still used a stale TLB mapping entry which contained a wrong target
physical address. It raised a bug when the task happened to read that
value.

CPU0 CPU1
- switch 'task' in
- read addr (Fill stale mapping
entry into TLB)
- switch 'task' out (no tlb_flush)
- switch 'task' in (no tlb_flush)
- write addr cause pagefault
do_page_fault() (change to
new addr mapping)
wp_page_copy()
ptep_clear_flush()
ptep_get_and_clear()
& flush_tlb_page()
write new value into addr
- switch 'task' out (no tlb_flush)
- switch 'task' in (no tlb_flush)
- read addr again (Use stale
mapping entry in TLB)
get wrong value from old phyical
addr, BUG!

The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
which could guarantee to invalidate all stale TLB entries during TLB
flush.

Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Cc: Anup Patel <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
---
Changes in v3:
- Move set/clear cpumask(mm) into set_mm (Make code more pretty
with Andrew's advice)
- Optimize comment description

Changes in v2:
- Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
test robot <[email protected]>)
- Keep cpumask_clear_cpu for noasid
---
arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 7acbfbd14557..0f784e3d307b 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
local_flush_tlb_all();
}

-static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
+static inline void set_mm(struct mm_struct *prev,
+ struct mm_struct *next, unsigned int cpu)
{
- if (static_branch_unlikely(&use_asid_allocator))
- set_mm_asid(mm, cpu);
- else
- set_mm_noasid(mm);
+ /*
+ * The mm_cpumask indicates which harts' TLBs contain the virtual
+ * address mapping of the mm. Compared to noasid, using asid
+ * can't guarantee that stale TLB entries are invalidated because
+ * the asid mechanism wouldn't flush TLB for every switch_mm for
+ * performance. So when using asid, keep all CPUs footmarks in
+ * cpumask() until mm reset.
+ */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+ if (static_branch_unlikely(&use_asid_allocator)) {
+ set_mm_asid(next, cpu);
+ } else {
+ cpumask_clear_cpu(cpu, mm_cpumask(prev));
+ set_mm_noasid(next);
+ }
}

static int __init asids_init(void)
@@ -264,7 +276,8 @@ static int __init asids_init(void)
}
early_initcall(asids_init);
#else
-static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
+static inline void set_mm(struct mm_struct *prev,
+ struct mm_struct *next, unsigned int cpu)
{
/* Nothing to do here when there is no MMU */
}
@@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
*/
cpu = smp_processor_id();

- cpumask_clear_cpu(cpu, mm_cpumask(prev));
- cpumask_set_cpu(cpu, mm_cpumask(next));
-
- set_mm(next, cpu);
+ set_mm(prev, next, cpu);

flush_icache_deferred(next, cpu);
}
--
2.36.1



2022-11-11 08:59:23

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

On Fri, Nov 11, 2022 at 02:59:02AM -0500, [email protected] wrote:
> From: Guo Ren <[email protected]>
>
> After use_asid_allocator is enabled, the userspace application will
> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> Then set_mm_asid would cause the user space application to get a stale
> value by stale TLB entry, but set_mm_noasid is okay.
>
> Here is the symptom of the bug:
> unhandled signal 11 code 0x1 (coredump)
> 0x0000003fd6d22524 <+4>: auipc s0,0x70
> 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
> => 0x0000003fd6d2252c <+12>: ld a5,0(s0)
> (gdb) i r s0
> s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
> (gdb) x /2x 0x3fd6d92490
> 0x3fd6d92490: 0xd80ac8a8 0x0000003f
> The core dump file shows that register s0 is wrong, but the value in
> memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> in TLB and got a wrong result from an incorrect physical address.
>
> When the task ran on CPU0, which loaded/speculative-loaded the value of
> address(0x3fd6d92490), then the first version of the mapping entry was
> PTWed into CPU0's TLB.
> When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> asid), it happened to write a value on the address (0x3fd6d92490). It
> caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> ptep_get_and_clear & flush_tlb_page.
> The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> switch_mm. So we only flushed the CPU1 TLB and set the second version
> mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> still used a stale TLB mapping entry which contained a wrong target
> physical address. It raised a bug when the task happened to read that
> value.
>
> CPU0 CPU1
> - switch 'task' in
> - read addr (Fill stale mapping
> entry into TLB)
> - switch 'task' out (no tlb_flush)
> - switch 'task' in (no tlb_flush)
> - write addr cause pagefault
> do_page_fault() (change to
> new addr mapping)
> wp_page_copy()
> ptep_clear_flush()
> ptep_get_and_clear()
> & flush_tlb_page()
> write new value into addr
> - switch 'task' out (no tlb_flush)
> - switch 'task' in (no tlb_flush)
> - read addr again (Use stale
> mapping entry in TLB)
> get wrong value from old phyical
> addr, BUG!
>
> The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> which could guarantee to invalidate all stale TLB entries during TLB
> flush.
>
> Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> Signed-off-by: Guo Ren <[email protected]>
> Signed-off-by: Guo Ren <[email protected]>
> Cc: Anup Patel <[email protected]>
> Cc: Palmer Dabbelt <[email protected]>
> ---
> Changes in v3:
> - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> with Andrew's advice)
> - Optimize comment description
>
> Changes in v2:
> - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> test robot <[email protected]>)
> - Keep cpumask_clear_cpu for noasid
> ---
> arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> 1 file changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index 7acbfbd14557..0f784e3d307b 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
> local_flush_tlb_all();
> }
>
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> + struct mm_struct *next, unsigned int cpu)
> {
> - if (static_branch_unlikely(&use_asid_allocator))
> - set_mm_asid(mm, cpu);
> - else
> - set_mm_noasid(mm);
> + /*
> + * The mm_cpumask indicates which harts' TLBs contain the virtual
> + * address mapping of the mm. Compared to noasid, using asid
> + * can't guarantee that stale TLB entries are invalidated because
> + * the asid mechanism wouldn't flush TLB for every switch_mm for
> + * performance. So when using asid, keep all CPUs footmarks in
> + * cpumask() until mm reset.
> + */
> + cpumask_set_cpu(cpu, mm_cpumask(next));
> + if (static_branch_unlikely(&use_asid_allocator)) {
> + set_mm_asid(next, cpu);
> + } else {
> + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> + set_mm_noasid(next);
> + }
> }
>
> static int __init asids_init(void)
> @@ -264,7 +276,8 @@ static int __init asids_init(void)
> }
> early_initcall(asids_init);
> #else
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> + struct mm_struct *next, unsigned int cpu)
> {
> /* Nothing to do here when there is no MMU */
> }
> @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> */
> cpu = smp_processor_id();
>
> - cpumask_clear_cpu(cpu, mm_cpumask(prev));
> - cpumask_set_cpu(cpu, mm_cpumask(next));
> -
> - set_mm(next, cpu);
> + set_mm(prev, next, cpu);
>
> flush_icache_deferred(next, cpu);
> }
> --
> 2.36.1
>

Reviewed-by: Andrew Jones <[email protected]>

2022-11-18 22:04:08

by Sergey Matyukevich

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

Hi Guo Ren,


> After use_asid_allocator is enabled, the userspace application will
> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> Then set_mm_asid would cause the user space application to get a stale
> value by stale TLB entry, but set_mm_noasid is okay.

... [snip]

> + /*
> + * The mm_cpumask indicates which harts' TLBs contain the virtual
> + * address mapping of the mm. Compared to noasid, using asid
> + * can't guarantee that stale TLB entries are invalidated because
> + * the asid mechanism wouldn't flush TLB for every switch_mm for
> + * performance. So when using asid, keep all CPUs footmarks in
> + * cpumask() until mm reset.
> + */
> + cpumask_set_cpu(cpu, mm_cpumask(next));
> + if (static_branch_unlikely(&use_asid_allocator)) {
> + set_mm_asid(next, cpu);
> + } else {
> + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> + set_mm_noasid(next);
> + }
> }

I observe similar user-space crashes on my SMP systems with enabled ASID.
My attempt to fix the issue was a bit different, see the following patch:

https://lore.kernel.org/linux-riscv/[email protected]/

In brief, the idea was borrowed from flush_icache_mm handling:
- keep track of CPUs not running the task
- perform per-ASID TLB flush on such CPUs only if the task is switched there

Your patch also works fine in my tests fixing those crashes. I have a
question though, regarding removed cpumask_clear_cpu. How CPUs no more
running the task are removed from its mm_cpumask ? If they are not
removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
TLB flushes to those CPUs when ASID is enabled.

Regards,
Sergey

2022-11-19 03:56:30

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

On Sat, Nov 19, 2022 at 4:57 AM Sergey Matyukevich <[email protected]> wrote:
>
> Hi Guo Ren,
>
>
> > After use_asid_allocator is enabled, the userspace application will
> > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > Then set_mm_asid would cause the user space application to get a stale
> > value by stale TLB entry, but set_mm_noasid is okay.
>
> ... [snip]
>
> > + /*
> > + * The mm_cpumask indicates which harts' TLBs contain the virtual
> > + * address mapping of the mm. Compared to noasid, using asid
> > + * can't guarantee that stale TLB entries are invalidated because
> > + * the asid mechanism wouldn't flush TLB for every switch_mm for
> > + * performance. So when using asid, keep all CPUs footmarks in
> > + * cpumask() until mm reset.
> > + */
> > + cpumask_set_cpu(cpu, mm_cpumask(next));
> > + if (static_branch_unlikely(&use_asid_allocator)) {
> > + set_mm_asid(next, cpu);
> > + } else {
> > + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > + set_mm_noasid(next);
> > + }
> > }
>
> I observe similar user-space crashes on my SMP systems with enabled ASID.
> My attempt to fix the issue was a bit different, see the following patch:
>
> https://lore.kernel.org/linux-riscv/[email protected]/
>
> In brief, the idea was borrowed from flush_icache_mm handling:
> - keep track of CPUs not running the task
> - perform per-ASID TLB flush on such CPUs only if the task is switched there
>
> Your patch also works fine in my tests fixing those crashes. I have a
> question though, regarding removed cpumask_clear_cpu. How CPUs no more
> running the task are removed from its mm_cpumask ? If they are not
> removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> TLB flushes to those CPUs when ASID is enabled.
A task would be migrated to any CPU by the scheduler. So keeping TLB
contents synced with cpumask_set/clear needs additional tlb_flush just
like noasid, and your patch still follows that style. The worth of
ASID is avoiding tlb_flush during the context switch. Yes, my patch
would increase some tlb_flush IPI costs. But when mapping is stable,
no tlb_flush is needed during the switch_mm (Hackbench would be
beneficiary because no more TLB flush is needed at his hot point
path). Here are my points:
- We copied the arm64 globally unique asid mechanism into riscv,
which depends on hardware broadcast TLB flush. My fixup patch is
closer to the original principle design, proven in the arm64 world.
- If riscv continues local TLB flush hw design in ISA spec, please
try x86's per-CPU array of ASID. But that is a significant change;
let's fix the current issue with the smallest patch first.

In the end, thx your review and test.
--
Best Regards
Guo Ren

2022-11-21 20:40:43

by Sergey Matyukevich

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

Hi Guo Ren,

> > > After use_asid_allocator is enabled, the userspace application will
> > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > Then set_mm_asid would cause the user space application to get a stale
> > > value by stale TLB entry, but set_mm_noasid is okay.
> >
> > ... [snip]
> >
> > > + /*
> > > + * The mm_cpumask indicates which harts' TLBs contain the virtual
> > > + * address mapping of the mm. Compared to noasid, using asid
> > > + * can't guarantee that stale TLB entries are invalidated because
> > > + * the asid mechanism wouldn't flush TLB for every switch_mm for
> > > + * performance. So when using asid, keep all CPUs footmarks in
> > > + * cpumask() until mm reset.
> > > + */
> > > + cpumask_set_cpu(cpu, mm_cpumask(next));
> > > + if (static_branch_unlikely(&use_asid_allocator)) {
> > > + set_mm_asid(next, cpu);
> > > + } else {
> > > + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > > + set_mm_noasid(next);
> > > + }
> > > }
> >
> > I observe similar user-space crashes on my SMP systems with enabled ASID.
> > My attempt to fix the issue was a bit different, see the following patch:
> >
> > https://lore.kernel.org/linux-riscv/[email protected]/
> >
> > In brief, the idea was borrowed from flush_icache_mm handling:
> > - keep track of CPUs not running the task
> > - perform per-ASID TLB flush on such CPUs only if the task is switched there
> >
> > Your patch also works fine in my tests fixing those crashes. I have a
> > question though, regarding removed cpumask_clear_cpu. How CPUs no more
> > running the task are removed from its mm_cpumask ? If they are not
> > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> > TLB flushes to those CPUs when ASID is enabled.
> A task would be migrated to any CPU by the scheduler. So keeping TLB
> contents synced with cpumask_set/clear needs additional tlb_flush just
> like noasid, and your patch still follows that style. The worth of
> ASID is avoiding tlb_flush during the context switch. Yes, my patch
> would increase some tlb_flush IPI costs. But when mapping is stable,
> no tlb_flush is needed during the switch_mm (Hackbench would be
> beneficiary because no more TLB flush is needed at his hot point
> path). Here are my points:
> - We copied the arm64 globally unique asid mechanism into riscv,
> which depends on hardware broadcast TLB flush. My fixup patch is
> closer to the original principle design, proven in the arm64 world.
> - If riscv continues local TLB flush hw design in ISA spec, please
> try x86's per-CPU array of ASID. But that is a significant change;
> let's fix the current issue with the smallest patch first.
>
> In the end, thx your review and test.

By the way, how did you verify the patch ? Do you have any good
reproducer for this issue ?

Regards,
Sergey

2022-11-21 21:09:51

by Sergey Matyukevich

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

> After use_asid_allocator is enabled, the userspace application will
> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> Then set_mm_asid would cause the user space application to get a stale
> value by stale TLB entry, but set_mm_noasid is okay.
>
> Here is the symptom of the bug:
> unhandled signal 11 code 0x1 (coredump)
> 0x0000003fd6d22524 <+4>: auipc s0,0x70
> 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
> => 0x0000003fd6d2252c <+12>: ld a5,0(s0)
> (gdb) i r s0
> s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
> (gdb) x /2x 0x3fd6d92490
> 0x3fd6d92490: 0xd80ac8a8 0x0000003f
> The core dump file shows that register s0 is wrong, but the value in
> memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> in TLB and got a wrong result from an incorrect physical address.
>
> When the task ran on CPU0, which loaded/speculative-loaded the value of
> address(0x3fd6d92490), then the first version of the mapping entry was
> PTWed into CPU0's TLB.
> When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> asid), it happened to write a value on the address (0x3fd6d92490). It
> caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> ptep_get_and_clear & flush_tlb_page.
> The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> switch_mm. So we only flushed the CPU1 TLB and set the second version
> mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> still used a stale TLB mapping entry which contained a wrong target
> physical address. It raised a bug when the task happened to read that
> value.
>
> CPU0 CPU1
> - switch 'task' in
> - read addr (Fill stale mapping
> entry into TLB)
> - switch 'task' out (no tlb_flush)
> - switch 'task' in (no tlb_flush)
> - write addr cause pagefault
> do_page_fault() (change to
> new addr mapping)
> wp_page_copy()
> ptep_clear_flush()
> ptep_get_and_clear()
> & flush_tlb_page()
> write new value into addr
> - switch 'task' out (no tlb_flush)
> - switch 'task' in (no tlb_flush)
> - read addr again (Use stale
> mapping entry in TLB)
> get wrong value from old phyical
> addr, BUG!
>
> The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> which could guarantee to invalidate all stale TLB entries during TLB
> flush.
>
> Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> Signed-off-by: Guo Ren <[email protected]>
> Signed-off-by: Guo Ren <[email protected]>
> Cc: Anup Patel <[email protected]>
> Cc: Palmer Dabbelt <[email protected]>
> ---
> Changes in v3:
> - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> with Andrew's advice)
> - Optimize comment description
>
> Changes in v2:
> - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> test robot <[email protected]>)
> - Keep cpumask_clear_cpu for noasid
> ---
> arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> 1 file changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index 7acbfbd14557..0f784e3d307b 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
> local_flush_tlb_all();
> }
>
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> + struct mm_struct *next, unsigned int cpu)
> {
> - if (static_branch_unlikely(&use_asid_allocator))
> - set_mm_asid(mm, cpu);
> - else
> - set_mm_noasid(mm);
> + /*
> + * The mm_cpumask indicates which harts' TLBs contain the virtual
> + * address mapping of the mm. Compared to noasid, using asid
> + * can't guarantee that stale TLB entries are invalidated because
> + * the asid mechanism wouldn't flush TLB for every switch_mm for
> + * performance. So when using asid, keep all CPUs footmarks in
> + * cpumask() until mm reset.
> + */
> + cpumask_set_cpu(cpu, mm_cpumask(next));
> + if (static_branch_unlikely(&use_asid_allocator)) {
> + set_mm_asid(next, cpu);
> + } else {
> + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> + set_mm_noasid(next);
> + }
> }
>
> static int __init asids_init(void)
> @@ -264,7 +276,8 @@ static int __init asids_init(void)
> }
> early_initcall(asids_init);
> #else
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> + struct mm_struct *next, unsigned int cpu)
> {
> /* Nothing to do here when there is no MMU */
> }
> @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> */
> cpu = smp_processor_id();
>
> - cpumask_clear_cpu(cpu, mm_cpumask(prev));
> - cpumask_set_cpu(cpu, mm_cpumask(next));
> -
> - set_mm(next, cpu);
> + set_mm(prev, next, cpu);
>
> flush_icache_deferred(next, cpu);
> }

Tested-by: Sergey Matyukevich <[email protected]>

Thanks,
Sergey

2022-12-09 00:05:50

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

On Fri, 18 Nov 2022 12:57:21 PST (-0800), [email protected] wrote:
> Hi Guo Ren,
>
>
>> After use_asid_allocator is enabled, the userspace application will
>> crash by stale TLB entries. Because only using cpumask_clear_cpu without
>> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
>> Then set_mm_asid would cause the user space application to get a stale
>> value by stale TLB entry, but set_mm_noasid is okay.
>
> ... [snip]
>
>> + /*
>> + * The mm_cpumask indicates which harts' TLBs contain the virtual
>> + * address mapping of the mm. Compared to noasid, using asid
>> + * can't guarantee that stale TLB entries are invalidated because
>> + * the asid mechanism wouldn't flush TLB for every switch_mm for
>> + * performance. So when using asid, keep all CPUs footmarks in
>> + * cpumask() until mm reset.
>> + */
>> + cpumask_set_cpu(cpu, mm_cpumask(next));
>> + if (static_branch_unlikely(&use_asid_allocator)) {
>> + set_mm_asid(next, cpu);
>> + } else {
>> + cpumask_clear_cpu(cpu, mm_cpumask(prev));
>> + set_mm_noasid(next);
>> + }
>> }
>
> I observe similar user-space crashes on my SMP systems with enabled ASID.
> My attempt to fix the issue was a bit different, see the following patch:
>
> https://lore.kernel.org/linux-riscv/[email protected]/
>
> In brief, the idea was borrowed from flush_icache_mm handling:
> - keep track of CPUs not running the task
> - perform per-ASID TLB flush on such CPUs only if the task is switched there

That way looks better to me: leaking hartids in the ASID allocator might
make the crashes go away, but it's just going to end up trending towards
flushing everything and that doesn't seem like the right long-term
solution.

So I've got that one on for-next, sorry I missed it before.

Thanks!

>
> Your patch also works fine in my tests fixing those crashes. I have a
> question though, regarding removed cpumask_clear_cpu. How CPUs no more
> running the task are removed from its mm_cpumask ? If they are not
> removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> TLB flushes to those CPUs when ASID is enabled.
>
> Regards,
> Sergey

2022-12-09 03:42:19

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

On Fri, Dec 9, 2022 at 7:30 AM Palmer Dabbelt <[email protected]> wrote:
>
> On Fri, 18 Nov 2022 12:57:21 PST (-0800), [email protected] wrote:
> > Hi Guo Ren,
> >
> >
> >> After use_asid_allocator is enabled, the userspace application will
> >> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> >> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> >> Then set_mm_asid would cause the user space application to get a stale
> >> value by stale TLB entry, but set_mm_noasid is okay.
> >
> > ... [snip]
> >
> >> + /*
> >> + * The mm_cpumask indicates which harts' TLBs contain the virtual
> >> + * address mapping of the mm. Compared to noasid, using asid
> >> + * can't guarantee that stale TLB entries are invalidated because
> >> + * the asid mechanism wouldn't flush TLB for every switch_mm for
> >> + * performance. So when using asid, keep all CPUs footmarks in
> >> + * cpumask() until mm reset.
> >> + */
> >> + cpumask_set_cpu(cpu, mm_cpumask(next));
> >> + if (static_branch_unlikely(&use_asid_allocator)) {
> >> + set_mm_asid(next, cpu);
> >> + } else {
> >> + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> >> + set_mm_noasid(next);
> >> + }
> >> }
> >
> > I observe similar user-space crashes on my SMP systems with enabled ASID.
> > My attempt to fix the issue was a bit different, see the following patch:
> >
> > https://lore.kernel.org/linux-riscv/[email protected]/
> >
> > In brief, the idea was borrowed from flush_icache_mm handling:
> > - keep track of CPUs not running the task
> > - perform per-ASID TLB flush on such CPUs only if the task is switched there
>
> That way looks better to me: leaking hartids in the ASID allocator might
> make the crashes go away, but it's just going to end up trending towards
> flushing everything and that doesn't seem like the right long-term
> solution.
The penalty in switch_mm is too heavy!!!
- If the system has multiple NUMA nodes, it will cause switch_mm_fast
flush unnecessary harts.
- If flush_range is just 1 entry, it would case flush_tlb_all_asid.

switch_mm_fast:
csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
((cntx & asid_mask) << SATP_ASID_SHIFT) |
satp_mode);

if (need_flush_tlb)
local_flush_tlb_all();
+#ifdef CONFIG_SMP
+ else {
+ cpumask_t *mask = &mm->context.tlb_stale_mask;+
+
+ if (cpumask_test_cpu(cpu, mask)) {
+ cpumask_clear_cpu(cpu, mask);
+ local_flush_tlb_all_asid(cntx & asid_mask);
// penalty in switch_mm fast path
+ }
+ }
+#endif

And See:
static void __sbi_tlb_flush_range(struct mm_struct *mm, unsigned long start,
unsigned long size, unsigned long stride)
{
+ struct cpumask *pmask = &mm->context.tlb_stale_mask;
struct cpumask *cmask = mm_cpumask(mm);
unsigned int cpuid;
bool broadcast;
@@ -44,6 +29,15 @@ static void __sbi_tlb_flush_range(struct mm_struct
*mm, unsigned long start,
if (static_branch_unlikely(&use_asid_allocator)) {
unsigned long asid = atomic_long_read(&mm->context.id);

+ /*
+ * TLB will be immediately flushed on harts concurrently
+ * executing this MM context. TLB flush on other harts
+ * is deferred until this MM context migrates there.
+ */
+ cpumask_setall(pmask);
^^^^^^^^^^^^^^^^^^^^^^^ It would flush all harts for
all NUMA nodes!!! Most of them deferred to switch_mm_fast. The penalty
contains unnecessary harts!
+ cpumask_clear_cpu(cpuid, pmask);
+ cpumask_andnot(pmask, pmask, cmask);

Please reconsider a bit, and make a smart decision. Just penalty the
harts who touched the mm, not all. And only flush the whole TLB when
some entries are needed.

The __sbi_tlb_flush_range is the slow path; keeping the fast path
performance is worth more than improving a slow one.

>
> So I've got that one on for-next, sorry I missed it before.
>
> Thanks!
>
> >
> > Your patch also works fine in my tests fixing those crashes. I have a
> > question though, regarding removed cpumask_clear_cpu. How CPUs no more
> > running the task are removed from its mm_cpumask ? If they are not
> > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> > TLB flushes to those CPUs when ASID is enabled.
> >
> > Regards,
> > Sergey



--
Best Regards
Guo Ren

2022-12-09 04:33:42

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

On Fri, Dec 9, 2022 at 11:13 AM Guo Ren <[email protected]> wrote:
>
> On Fri, Dec 9, 2022 at 7:30 AM Palmer Dabbelt <[email protected]> wrote:
> >
> > On Fri, 18 Nov 2022 12:57:21 PST (-0800), [email protected] wrote:
> > > Hi Guo Ren,
> > >
> > >
> > >> After use_asid_allocator is enabled, the userspace application will
> > >> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > >> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > >> Then set_mm_asid would cause the user space application to get a stale
> > >> value by stale TLB entry, but set_mm_noasid is okay.
> > >
> > > ... [snip]
> > >
> > >> + /*
> > >> + * The mm_cpumask indicates which harts' TLBs contain the virtual
> > >> + * address mapping of the mm. Compared to noasid, using asid
> > >> + * can't guarantee that stale TLB entries are invalidated because
> > >> + * the asid mechanism wouldn't flush TLB for every switch_mm for
> > >> + * performance. So when using asid, keep all CPUs footmarks in
> > >> + * cpumask() until mm reset.
> > >> + */
> > >> + cpumask_set_cpu(cpu, mm_cpumask(next));
> > >> + if (static_branch_unlikely(&use_asid_allocator)) {
> > >> + set_mm_asid(next, cpu);
> > >> + } else {
> > >> + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > >> + set_mm_noasid(next);
> > >> + }
> > >> }
> > >
> > > I observe similar user-space crashes on my SMP systems with enabled ASID.
> > > My attempt to fix the issue was a bit different, see the following patch:
> > >
> > > https://lore.kernel.org/linux-riscv/[email protected]/
> > >
> > > In brief, the idea was borrowed from flush_icache_mm handling:
> > > - keep track of CPUs not running the task
> > > - perform per-ASID TLB flush on such CPUs only if the task is switched there
> >
> > That way looks better to me: leaking hartids in the ASID allocator might
> > make the crashes go away, but it's just going to end up trending towards
> > flushing everything and that doesn't seem like the right long-term
> > solution.
> The penalty in switch_mm is too heavy!!!
> - If the system has multiple NUMA nodes, it will cause switch_mm_fast
> flush unnecessary harts.
> - If flush_range is just 1 entry, it would case flush_tlb_all_asid.
>
> switch_mm_fast:
> csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
> ((cntx & asid_mask) << SATP_ASID_SHIFT) |
> satp_mode);
>
> if (need_flush_tlb)
> local_flush_tlb_all();
> +#ifdef CONFIG_SMP
> + else {
> + cpumask_t *mask = &mm->context.tlb_stale_mask;+
> +
> + if (cpumask_test_cpu(cpu, mask)) {
> + cpumask_clear_cpu(cpu, mask);
> + local_flush_tlb_all_asid(cntx & asid_mask);
> // penalty in switch_mm fast path
> + }
> + }
> +#endif
>
> And See:
> static void __sbi_tlb_flush_range(struct mm_struct *mm, unsigned long start,
> unsigned long size, unsigned long stride)
> {
> + struct cpumask *pmask = &mm->context.tlb_stale_mask;
> struct cpumask *cmask = mm_cpumask(mm);
> unsigned int cpuid;
> bool broadcast;
> @@ -44,6 +29,15 @@ static void __sbi_tlb_flush_range(struct mm_struct
> *mm, unsigned long start,
> if (static_branch_unlikely(&use_asid_allocator)) {
> unsigned long asid = atomic_long_read(&mm->context.id);
>
> + /*
> + * TLB will be immediately flushed on harts concurrently
> + * executing this MM context. TLB flush on other harts
> + * is deferred until this MM context migrates there.
> + */
> + cpumask_setall(pmask);
> ^^^^^^^^^^^^^^^^^^^^^^^ It would flush all harts for
> all NUMA nodes!!! Most of them deferred to switch_mm_fast. The penalty
> contains unnecessary harts!
> + cpumask_clear_cpu(cpuid, pmask);
> + cpumask_andnot(pmask, pmask, cmask);
>
> Please reconsider a bit, and make a smart decision. Just penalty the
> harts who touched the mm, not all. And only flush the whole TLB when
^^^^ ^^^^ Don't
> some entries are needed.
>
> The __sbi_tlb_flush_range is the slow path; keeping the fast path
> performance is worth more than improving a slow one.
>
> >
> > So I've got that one on for-next, sorry I missed it before.
> >
> > Thanks!
> >
> > >
> > > Your patch also works fine in my tests fixing those crashes. I have a
> > > question though, regarding removed cpumask_clear_cpu. How CPUs no more
> > > running the task are removed from its mm_cpumask ? If they are not
> > > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> > > TLB flushes to those CPUs when ASID is enabled.
> > >
> > > Regards,
> > > Sergey
>
>
>
> --
> Best Regards
> Guo Ren



--
Best Regards
Guo Ren

2022-12-23 13:04:44

by Prabhakar

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

Hi Guo,

Thank you for the patch.

On Fri, Nov 11, 2022 at 8:00 AM <[email protected]> wrote:
>
> From: Guo Ren <[email protected]>
>
> After use_asid_allocator is enabled, the userspace application will
> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> Then set_mm_asid would cause the user space application to get a stale
> value by stale TLB entry, but set_mm_noasid is okay.
>
> Here is the symptom of the bug:
> unhandled signal 11 code 0x1 (coredump)
> 0x0000003fd6d22524 <+4>: auipc s0,0x70
> 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
> => 0x0000003fd6d2252c <+12>: ld a5,0(s0)
> (gdb) i r s0
> s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
> (gdb) x /2x 0x3fd6d92490
> 0x3fd6d92490: 0xd80ac8a8 0x0000003f
> The core dump file shows that register s0 is wrong, but the value in
> memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> in TLB and got a wrong result from an incorrect physical address.
>
> When the task ran on CPU0, which loaded/speculative-loaded the value of
> address(0x3fd6d92490), then the first version of the mapping entry was
> PTWed into CPU0's TLB.
> When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> asid), it happened to write a value on the address (0x3fd6d92490). It
> caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> ptep_get_and_clear & flush_tlb_page.
> The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> switch_mm. So we only flushed the CPU1 TLB and set the second version
> mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> still used a stale TLB mapping entry which contained a wrong target
> physical address. It raised a bug when the task happened to read that
> value.
>
> CPU0 CPU1
> - switch 'task' in
> - read addr (Fill stale mapping
> entry into TLB)
> - switch 'task' out (no tlb_flush)
> - switch 'task' in (no tlb_flush)
> - write addr cause pagefault
> do_page_fault() (change to
> new addr mapping)
> wp_page_copy()
> ptep_clear_flush()
> ptep_get_and_clear()
> & flush_tlb_page()
> write new value into addr
> - switch 'task' out (no tlb_flush)
> - switch 'task' in (no tlb_flush)
> - read addr again (Use stale
> mapping entry in TLB)
> get wrong value from old phyical
> addr, BUG!
>
> The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> which could guarantee to invalidate all stale TLB entries during TLB
> flush.
>
> Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> Signed-off-by: Guo Ren <[email protected]>
> Signed-off-by: Guo Ren <[email protected]>
> Cc: Anup Patel <[email protected]>
> Cc: Palmer Dabbelt <[email protected]>
> ---
> Changes in v3:
> - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> with Andrew's advice)
> - Optimize comment description
>
> Changes in v2:
> - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> test robot <[email protected]>)
> - Keep cpumask_clear_cpu for noasid
> ---
> arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> 1 file changed, 20 insertions(+), 10 deletions(-)
>
As reported on the patch [0] I was seeing consistent failures on the
RZ/Five SoC while running bonnie++ utility. After applying this patch
on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
seeing this issue.

Tested-by: Lad Prabhakar <[email protected]>

[0] https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/

Cheers,
Prabhakar

> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index 7acbfbd14557..0f784e3d307b 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
> local_flush_tlb_all();
> }
>
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> + struct mm_struct *next, unsigned int cpu)
> {
> - if (static_branch_unlikely(&use_asid_allocator))
> - set_mm_asid(mm, cpu);
> - else
> - set_mm_noasid(mm);
> + /*
> + * The mm_cpumask indicates which harts' TLBs contain the virtual
> + * address mapping of the mm. Compared to noasid, using asid
> + * can't guarantee that stale TLB entries are invalidated because
> + * the asid mechanism wouldn't flush TLB for every switch_mm for
> + * performance. So when using asid, keep all CPUs footmarks in
> + * cpumask() until mm reset.
> + */
> + cpumask_set_cpu(cpu, mm_cpumask(next));
> + if (static_branch_unlikely(&use_asid_allocator)) {
> + set_mm_asid(next, cpu);
> + } else {
> + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> + set_mm_noasid(next);
> + }
> }
>
> static int __init asids_init(void)
> @@ -264,7 +276,8 @@ static int __init asids_init(void)
> }
> early_initcall(asids_init);
> #else
> -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> +static inline void set_mm(struct mm_struct *prev,
> + struct mm_struct *next, unsigned int cpu)
> {
> /* Nothing to do here when there is no MMU */
> }
> @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> */
> cpu = smp_processor_id();
>
> - cpumask_clear_cpu(cpu, mm_cpumask(prev));
> - cpumask_set_cpu(cpu, mm_cpumask(next));
> -
> - set_mm(next, cpu);
> + set_mm(prev, next, cpu);
>
> flush_icache_deferred(next, cpu);
> }
> --
> 2.36.1
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2023-02-23 17:58:15

by Zong Li

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

Lad, Prabhakar <[email protected]> 於 2022年12月23日 週五 下午8:54寫道:
>
> Hi Guo,
>
> Thank you for the patch.
>
> On Fri, Nov 11, 2022 at 8:00 AM <[email protected]> wrote:
> >
> > From: Guo Ren <[email protected]>
> >
> > After use_asid_allocator is enabled, the userspace application will
> > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > Then set_mm_asid would cause the user space application to get a stale
> > value by stale TLB entry, but set_mm_noasid is okay.
> >
> > Here is the symptom of the bug:
> > unhandled signal 11 code 0x1 (coredump)
> > 0x0000003fd6d22524 <+4>: auipc s0,0x70
> > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
> > => 0x0000003fd6d2252c <+12>: ld a5,0(s0)
> > (gdb) i r s0
> > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
> > (gdb) x /2x 0x3fd6d92490
> > 0x3fd6d92490: 0xd80ac8a8 0x0000003f
> > The core dump file shows that register s0 is wrong, but the value in
> > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > in TLB and got a wrong result from an incorrect physical address.
> >
> > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > address(0x3fd6d92490), then the first version of the mapping entry was
> > PTWed into CPU0's TLB.
> > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > asid), it happened to write a value on the address (0x3fd6d92490). It
> > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > ptep_get_and_clear & flush_tlb_page.
> > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > still used a stale TLB mapping entry which contained a wrong target
> > physical address. It raised a bug when the task happened to read that
> > value.
> >
> > CPU0 CPU1
> > - switch 'task' in
> > - read addr (Fill stale mapping
> > entry into TLB)
> > - switch 'task' out (no tlb_flush)
> > - switch 'task' in (no tlb_flush)
> > - write addr cause pagefault
> > do_page_fault() (change to
> > new addr mapping)
> > wp_page_copy()
> > ptep_clear_flush()
> > ptep_get_and_clear()
> > & flush_tlb_page()
> > write new value into addr
> > - switch 'task' out (no tlb_flush)
> > - switch 'task' in (no tlb_flush)
> > - read addr again (Use stale
> > mapping entry in TLB)
> > get wrong value from old phyical
> > addr, BUG!
> >
> > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > which could guarantee to invalidate all stale TLB entries during TLB
> > flush.
> >
> > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > Signed-off-by: Guo Ren <[email protected]>
> > Signed-off-by: Guo Ren <[email protected]>
> > Cc: Anup Patel <[email protected]>
> > Cc: Palmer Dabbelt <[email protected]>
> > ---
> > Changes in v3:
> > - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > with Andrew's advice)
> > - Optimize comment description
> >
> > Changes in v2:
> > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > test robot <[email protected]>)
> > - Keep cpumask_clear_cpu for noasid
> > ---
> > arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > 1 file changed, 20 insertions(+), 10 deletions(-)
> >
> As reported on the patch [0] I was seeing consistent failures on the
> RZ/Five SoC while running bonnie++ utility. After applying this patch
> on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> seeing this issue.
>
> Tested-by: Lad Prabhakar <[email protected]>
>
> [0] https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/
>

Hi all,
I got the same situation (i.e. unhandle signal 11) on our internal
multi-core system, I tried the patch[0] & [1], but it still doesn't
work, I guess there are still some potential problems. After applying
this patch, the situation disappeared, I took some time to look at
other arches' implementations, such as arc, they don't clear the
mm_cpumask due to the similar issue. I can't say which approach might
be better, but I'd like to point out that this patch works to me.
Thanks.

Tested-by: Zong Li <[email protected]>

[0] https://lore.kernel.org/linux-riscv/[email protected]/
[1] https://lore.kernel.org/linux-riscv/[email protected]/

> Cheers,
> Prabhakar
>
> > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> > index 7acbfbd14557..0f784e3d307b 100644
> > --- a/arch/riscv/mm/context.c
> > +++ b/arch/riscv/mm/context.c
> > @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
> > local_flush_tlb_all();
> > }
> >
> > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> > +static inline void set_mm(struct mm_struct *prev,
> > + struct mm_struct *next, unsigned int cpu)
> > {
> > - if (static_branch_unlikely(&use_asid_allocator))
> > - set_mm_asid(mm, cpu);
> > - else
> > - set_mm_noasid(mm);
> > + /*
> > + * The mm_cpumask indicates which harts' TLBs contain the virtual
> > + * address mapping of the mm. Compared to noasid, using asid
> > + * can't guarantee that stale TLB entries are invalidated because
> > + * the asid mechanism wouldn't flush TLB for every switch_mm for
> > + * performance. So when using asid, keep all CPUs footmarks in
> > + * cpumask() until mm reset.
> > + */
> > + cpumask_set_cpu(cpu, mm_cpumask(next));
> > + if (static_branch_unlikely(&use_asid_allocator)) {
> > + set_mm_asid(next, cpu);
> > + } else {
> > + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > + set_mm_noasid(next);
> > + }
> > }
> >
> > static int __init asids_init(void)
> > @@ -264,7 +276,8 @@ static int __init asids_init(void)
> > }
> > early_initcall(asids_init);
> > #else
> > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> > +static inline void set_mm(struct mm_struct *prev,
> > + struct mm_struct *next, unsigned int cpu)
> > {
> > /* Nothing to do here when there is no MMU */
> > }
> > @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > */
> > cpu = smp_processor_id();
> >
> > - cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > - cpumask_set_cpu(cpu, mm_cpumask(next));
> > -
> > - set_mm(next, cpu);
> > + set_mm(prev, next, cpu);
> >
> > flush_icache_deferred(next, cpu);
> > }
> > --
> > 2.36.1
> >
> >
> > _______________________________________________
> > linux-riscv mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2023-02-23 23:20:59

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

On Fri, Feb 24, 2023 at 1:58 AM Zong Li <[email protected]> wrote:
>
> Lad, Prabhakar <[email protected]> 於 2022年12月23日 週五 下午8:54寫道:
> >
> > Hi Guo,
> >
> > Thank you for the patch.
> >
> > On Fri, Nov 11, 2022 at 8:00 AM <[email protected]> wrote:
> > >
> > > From: Guo Ren <[email protected]>
> > >
> > > After use_asid_allocator is enabled, the userspace application will
> > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > Then set_mm_asid would cause the user space application to get a stale
> > > value by stale TLB entry, but set_mm_noasid is okay.
> > >
> > > Here is the symptom of the bug:
> > > unhandled signal 11 code 0x1 (coredump)
> > > 0x0000003fd6d22524 <+4>: auipc s0,0x70
> > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
> > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0)
> > > (gdb) i r s0
> > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
> > > (gdb) x /2x 0x3fd6d92490
> > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f
> > > The core dump file shows that register s0 is wrong, but the value in
> > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > in TLB and got a wrong result from an incorrect physical address.
> > >
> > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > PTWed into CPU0's TLB.
> > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > ptep_get_and_clear & flush_tlb_page.
> > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > still used a stale TLB mapping entry which contained a wrong target
> > > physical address. It raised a bug when the task happened to read that
> > > value.
> > >
> > > CPU0 CPU1
> > > - switch 'task' in
> > > - read addr (Fill stale mapping
> > > entry into TLB)
> > > - switch 'task' out (no tlb_flush)
> > > - switch 'task' in (no tlb_flush)
> > > - write addr cause pagefault
> > > do_page_fault() (change to
> > > new addr mapping)
> > > wp_page_copy()
> > > ptep_clear_flush()
> > > ptep_get_and_clear()
> > > & flush_tlb_page()
> > > write new value into addr
> > > - switch 'task' out (no tlb_flush)
> > > - switch 'task' in (no tlb_flush)
> > > - read addr again (Use stale
> > > mapping entry in TLB)
> > > get wrong value from old phyical
> > > addr, BUG!
> > >
> > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > which could guarantee to invalidate all stale TLB entries during TLB
> > > flush.
> > >
> > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > Signed-off-by: Guo Ren <[email protected]>
> > > Signed-off-by: Guo Ren <[email protected]>
> > > Cc: Anup Patel <[email protected]>
> > > Cc: Palmer Dabbelt <[email protected]>
> > > ---
> > > Changes in v3:
> > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > > with Andrew's advice)
> > > - Optimize comment description
> > >
> > > Changes in v2:
> > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > > test robot <[email protected]>)
> > > - Keep cpumask_clear_cpu for noasid
> > > ---
> > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > > 1 file changed, 20 insertions(+), 10 deletions(-)
> > >
> > As reported on the patch [0] I was seeing consistent failures on the
> > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > seeing this issue.
> >
> > Tested-by: Lad Prabhakar <[email protected]>
> >
> > [0] https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/
> >
>
> Hi all,
> I got the same situation (i.e. unhandle signal 11) on our internal
> multi-core system, I tried the patch[0] & [1], but it still doesn't
> work, I guess there are still some potential problems. After applying
> this patch, the situation disappeared, I took some time to look at
> other arches' implementations, such as arc, they don't clear the
> mm_cpumask due to the similar issue. I can't say which approach might
> be better, but I'd like to point out that this patch works to me.
> Thanks.
Thx for the report.

I think the patch is simpler and more direct to the problem than the
current fixup approach. Although the maintainer thought it was a
temporary solution.

This patch has been tested in our production environment for huge
amounts of thousand of hours. You can trust it :)

>
> Tested-by: Zong Li <[email protected]>


>
> [0] https://lore.kernel.org/linux-riscv/[email protected]/
> [1] https://lore.kernel.org/linux-riscv/[email protected]/
>
> > Cheers,
> > Prabhakar
> >
> > > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> > > index 7acbfbd14557..0f784e3d307b 100644
> > > --- a/arch/riscv/mm/context.c
> > > +++ b/arch/riscv/mm/context.c
> > > @@ -205,12 +205,24 @@ static void set_mm_noasid(struct mm_struct *mm)
> > > local_flush_tlb_all();
> > > }
> > >
> > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> > > +static inline void set_mm(struct mm_struct *prev,
> > > + struct mm_struct *next, unsigned int cpu)
> > > {
> > > - if (static_branch_unlikely(&use_asid_allocator))
> > > - set_mm_asid(mm, cpu);
> > > - else
> > > - set_mm_noasid(mm);
> > > + /*
> > > + * The mm_cpumask indicates which harts' TLBs contain the virtual
> > > + * address mapping of the mm. Compared to noasid, using asid
> > > + * can't guarantee that stale TLB entries are invalidated because
> > > + * the asid mechanism wouldn't flush TLB for every switch_mm for
> > > + * performance. So when using asid, keep all CPUs footmarks in
> > > + * cpumask() until mm reset.
> > > + */
> > > + cpumask_set_cpu(cpu, mm_cpumask(next));
> > > + if (static_branch_unlikely(&use_asid_allocator)) {
> > > + set_mm_asid(next, cpu);
> > > + } else {
> > > + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > > + set_mm_noasid(next);
> > > + }
> > > }
> > >
> > > static int __init asids_init(void)
> > > @@ -264,7 +276,8 @@ static int __init asids_init(void)
> > > }
> > > early_initcall(asids_init);
> > > #else
> > > -static inline void set_mm(struct mm_struct *mm, unsigned int cpu)
> > > +static inline void set_mm(struct mm_struct *prev,
> > > + struct mm_struct *next, unsigned int cpu)
> > > {
> > > /* Nothing to do here when there is no MMU */
> > > }
> > > @@ -317,10 +330,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > */
> > > cpu = smp_processor_id();
> > >
> > > - cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > > - cpumask_set_cpu(cpu, mm_cpumask(next));
> > > -
> > > - set_mm(next, cpu);
> > > + set_mm(prev, next, cpu);
> > >
> > > flush_icache_deferred(next, cpu);
> > > }
> > > --
> > > 2.36.1
> > >
> > >
> > > _______________________________________________
> > > linux-riscv mailing list
> > > [email protected]
> > > http://lists.infradead.org/mailman/listinfo/linux-riscv
> >
> > _______________________________________________
> > linux-riscv mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-riscv



--
Best Regards
Guo Ren

2023-02-25 19:29:16

by Sergey Matyukevich

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote:
> Lad, Prabhakar <[email protected]> 於 2022年12月23日 週五 下午8:54寫道:
> >
> > Hi Guo,
> >
> > Thank you for the patch.
> >
> > On Fri, Nov 11, 2022 at 8:00 AM <[email protected]> wrote:
> > >
> > > From: Guo Ren <[email protected]>
> > >
> > > After use_asid_allocator is enabled, the userspace application will
> > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > Then set_mm_asid would cause the user space application to get a stale
> > > value by stale TLB entry, but set_mm_noasid is okay.
> > >
> > > Here is the symptom of the bug:
> > > unhandled signal 11 code 0x1 (coredump)
> > > 0x0000003fd6d22524 <+4>: auipc s0,0x70
> > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
> > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0)
> > > (gdb) i r s0
> > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
> > > (gdb) x /2x 0x3fd6d92490
> > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f
> > > The core dump file shows that register s0 is wrong, but the value in
> > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > in TLB and got a wrong result from an incorrect physical address.
> > >
> > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > PTWed into CPU0's TLB.
> > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > ptep_get_and_clear & flush_tlb_page.
> > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > still used a stale TLB mapping entry which contained a wrong target
> > > physical address. It raised a bug when the task happened to read that
> > > value.
> > >
> > > CPU0 CPU1
> > > - switch 'task' in
> > > - read addr (Fill stale mapping
> > > entry into TLB)
> > > - switch 'task' out (no tlb_flush)
> > > - switch 'task' in (no tlb_flush)
> > > - write addr cause pagefault
> > > do_page_fault() (change to
> > > new addr mapping)
> > > wp_page_copy()
> > > ptep_clear_flush()
> > > ptep_get_and_clear()
> > > & flush_tlb_page()
> > > write new value into addr
> > > - switch 'task' out (no tlb_flush)
> > > - switch 'task' in (no tlb_flush)
> > > - read addr again (Use stale
> > > mapping entry in TLB)
> > > get wrong value from old phyical
> > > addr, BUG!
> > >
> > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > which could guarantee to invalidate all stale TLB entries during TLB
> > > flush.
> > >
> > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > Signed-off-by: Guo Ren <[email protected]>
> > > Signed-off-by: Guo Ren <[email protected]>
> > > Cc: Anup Patel <[email protected]>
> > > Cc: Palmer Dabbelt <[email protected]>
> > > ---
> > > Changes in v3:
> > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > > with Andrew's advice)
> > > - Optimize comment description
> > >
> > > Changes in v2:
> > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > > test robot <[email protected]>)
> > > - Keep cpumask_clear_cpu for noasid
> > > ---
> > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > > 1 file changed, 20 insertions(+), 10 deletions(-)
> > >
> > As reported on the patch [0] I was seeing consistent failures on the
> > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > seeing this issue.
> >
> > Tested-by: Lad Prabhakar <[email protected]>
> >
> > [0] https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/
> >
>
> Hi all,
> I got the same situation (i.e. unhandle signal 11) on our internal
> multi-core system, I tried the patch[0] & [1], but it still doesn't
> work, I guess there are still some potential problems. After applying
> this patch, the situation disappeared, I took some time to look at
> other arches' implementations, such as arc, they don't clear the
> mm_cpumask due to the similar issue. I can't say which approach might
> be better, but I'd like to point out that this patch works to me.
> Thanks.
>
> Tested-by: Zong Li <[email protected]>
>
> [0] https://lore.kernel.org/linux-riscv/[email protected]/
> [1] https://lore.kernel.org/linux-riscv/[email protected]/

Thanks for the report! By the way, could you please share some
information about the reproducing workload ?

Initial idea was to reduce the number of TLB flushes by deferring (and
possibly avoiding) some of them. But we have already bug reports from
two different vendors, so apparently something is overlooked here.
Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren.

@Guo Ren, do you mind if I re-send your v3 patch together with the
remaining reverts of my changes ?

Regards,
Sergey

2023-02-26 04:27:57

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

On Sat, Feb 25, 2023 at 2:29 PM Sergey Matyukevich <[email protected]> wrote:
>
> On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote:
> > Lad, Prabhakar <[email protected]> 於 2022年12月23日 週五 下午8:54寫道:
> > >
> > > Hi Guo,
> > >
> > > Thank you for the patch.
> > >
> > > On Fri, Nov 11, 2022 at 8:00 AM <[email protected]> wrote:
> > > >
> > > > From: Guo Ren <[email protected]>
> > > >
> > > > After use_asid_allocator is enabled, the userspace application will
> > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > > Then set_mm_asid would cause the user space application to get a stale
> > > > value by stale TLB entry, but set_mm_noasid is okay.
> > > >
> > > > Here is the symptom of the bug:
> > > > unhandled signal 11 code 0x1 (coredump)
> > > > 0x0000003fd6d22524 <+4>: auipc s0,0x70
> > > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
> > > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0)
> > > > (gdb) i r s0
> > > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
> > > > (gdb) x /2x 0x3fd6d92490
> > > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f
> > > > The core dump file shows that register s0 is wrong, but the value in
> > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > > in TLB and got a wrong result from an incorrect physical address.
> > > >
> > > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > > PTWed into CPU0's TLB.
> > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > > ptep_get_and_clear & flush_tlb_page.
> > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > > still used a stale TLB mapping entry which contained a wrong target
> > > > physical address. It raised a bug when the task happened to read that
> > > > value.
> > > >
> > > > CPU0 CPU1
> > > > - switch 'task' in
> > > > - read addr (Fill stale mapping
> > > > entry into TLB)
> > > > - switch 'task' out (no tlb_flush)
> > > > - switch 'task' in (no tlb_flush)
> > > > - write addr cause pagefault
> > > > do_page_fault() (change to
> > > > new addr mapping)
> > > > wp_page_copy()
> > > > ptep_clear_flush()
> > > > ptep_get_and_clear()
> > > > & flush_tlb_page()
> > > > write new value into addr
> > > > - switch 'task' out (no tlb_flush)
> > > > - switch 'task' in (no tlb_flush)
> > > > - read addr again (Use stale
> > > > mapping entry in TLB)
> > > > get wrong value from old phyical
> > > > addr, BUG!
> > > >
> > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > > which could guarantee to invalidate all stale TLB entries during TLB
> > > > flush.
> > > >
> > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > > Signed-off-by: Guo Ren <[email protected]>
> > > > Signed-off-by: Guo Ren <[email protected]>
> > > > Cc: Anup Patel <[email protected]>
> > > > Cc: Palmer Dabbelt <[email protected]>
> > > > ---
> > > > Changes in v3:
> > > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > > > with Andrew's advice)
> > > > - Optimize comment description
> > > >
> > > > Changes in v2:
> > > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > > > test robot <[email protected]>)
> > > > - Keep cpumask_clear_cpu for noasid
> > > > ---
> > > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > > > 1 file changed, 20 insertions(+), 10 deletions(-)
> > > >
> > > As reported on the patch [0] I was seeing consistent failures on the
> > > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > > seeing this issue.
> > >
> > > Tested-by: Lad Prabhakar <[email protected]>
> > >
> > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/
> > >
> >
> > Hi all,
> > I got the same situation (i.e. unhandle signal 11) on our internal
> > multi-core system, I tried the patch[0] & [1], but it still doesn't
> > work, I guess there are still some potential problems. After applying
> > this patch, the situation disappeared, I took some time to look at
> > other arches' implementations, such as arc, they don't clear the
> > mm_cpumask due to the similar issue. I can't say which approach might
> > be better, but I'd like to point out that this patch works to me.
> > Thanks.
> >
> > Tested-by: Zong Li <[email protected]>
> >
> > [0] https://lore.kernel.org/linux-riscv/[email protected]/
> > [1] https://lore.kernel.org/linux-riscv/[email protected]/
>
> Thanks for the report! By the way, could you please share some
> information about the reproducing workload ?
>
> Initial idea was to reduce the number of TLB flushes by deferring (and
> possibly avoiding) some of them. But we have already bug reports from
> two different vendors, so apparently something is overlooked here.
> Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren.
>
> @Guo Ren, do you mind if I re-send your v3 patch together with the
> remaining reverts of my changes ?
Okay, thx for taking care. Let's make it work around first and then improve it.

Actually, the current riscv asid is from arm64 with hardware broadcast
requirements. Maybe we need to consider x86 per-cpu asid pool way.

>
> Regards,
> Sergey



--
Best Regards
Guo Ren

2023-02-27 22:43:12

by Gary Guo

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

On Sat, 25 Feb 2023 23:24:40 -0500
Guo Ren <[email protected]> wrote:

> On Sat, Feb 25, 2023 at 2:29 PM Sergey Matyukevich <[email protected]> wrote:
> >
> > On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote:
> > > Lad, Prabhakar <[email protected]> 於 2022年12月23日 週五 下午8:54寫道:
> > > >
> > > > Hi Guo,
> > > >
> > > > Thank you for the patch.
> > > >
> > > > On Fri, Nov 11, 2022 at 8:00 AM <[email protected]> wrote:
> > > > >
> > > > > From: Guo Ren <[email protected]>
> > > > >
> > > > > After use_asid_allocator is enabled, the userspace application will
> > > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > > > Then set_mm_asid would cause the user space application to get a stale
> > > > > value by stale TLB entry, but set_mm_noasid is okay.
> > > > >
> > > > > Here is the symptom of the bug:
> > > > > unhandled signal 11 code 0x1 (coredump)
> > > > > 0x0000003fd6d22524 <+4>: auipc s0,0x70
> > > > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
> > > > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0)
> > > > > (gdb) i r s0
> > > > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
> > > > > (gdb) x /2x 0x3fd6d92490
> > > > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f
> > > > > The core dump file shows that register s0 is wrong, but the value in
> > > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > > > in TLB and got a wrong result from an incorrect physical address.
> > > > >
> > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > > > PTWed into CPU0's TLB.
> > > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > > > ptep_get_and_clear & flush_tlb_page.
> > > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > > > still used a stale TLB mapping entry which contained a wrong target
> > > > > physical address. It raised a bug when the task happened to read that
> > > > > value.
> > > > >
> > > > > CPU0 CPU1
> > > > > - switch 'task' in
> > > > > - read addr (Fill stale mapping
> > > > > entry into TLB)
> > > > > - switch 'task' out (no tlb_flush)
> > > > > - switch 'task' in (no tlb_flush)
> > > > > - write addr cause pagefault
> > > > > do_page_fault() (change to
> > > > > new addr mapping)
> > > > > wp_page_copy()
> > > > > ptep_clear_flush()
> > > > > ptep_get_and_clear()
> > > > > & flush_tlb_page()
> > > > > write new value into addr
> > > > > - switch 'task' out (no tlb_flush)
> > > > > - switch 'task' in (no tlb_flush)
> > > > > - read addr again (Use stale
> > > > > mapping entry in TLB)
> > > > > get wrong value from old phyical
> > > > > addr, BUG!
> > > > >
> > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > > > which could guarantee to invalidate all stale TLB entries during TLB
> > > > > flush.
> > > > >
> > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > > > Signed-off-by: Guo Ren <[email protected]>
> > > > > Signed-off-by: Guo Ren <[email protected]>
> > > > > Cc: Anup Patel <[email protected]>
> > > > > Cc: Palmer Dabbelt <[email protected]>
> > > > > ---
> > > > > Changes in v3:
> > > > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > > > > with Andrew's advice)
> > > > > - Optimize comment description
> > > > >
> > > > > Changes in v2:
> > > > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > > > > test robot <[email protected]>)
> > > > > - Keep cpumask_clear_cpu for noasid
> > > > > ---
> > > > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > > > > 1 file changed, 20 insertions(+), 10 deletions(-)
> > > > >
> > > > As reported on the patch [0] I was seeing consistent failures on the
> > > > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > > > seeing this issue.
> > > >
> > > > Tested-by: Lad Prabhakar <[email protected]>
> > > >
> > > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/
> > > >
> > >
> > > Hi all,
> > > I got the same situation (i.e. unhandle signal 11) on our internal
> > > multi-core system, I tried the patch[0] & [1], but it still doesn't
> > > work, I guess there are still some potential problems. After applying
> > > this patch, the situation disappeared, I took some time to look at
> > > other arches' implementations, such as arc, they don't clear the
> > > mm_cpumask due to the similar issue. I can't say which approach might
> > > be better, but I'd like to point out that this patch works to me.
> > > Thanks.
> > >
> > > Tested-by: Zong Li <[email protected]>
> > >
> > > [0] https://lore.kernel.org/linux-riscv/[email protected]/
> > > [1] https://lore.kernel.org/linux-riscv/[email protected]/
> >
> > Thanks for the report! By the way, could you please share some
> > information about the reproducing workload ?
> >
> > Initial idea was to reduce the number of TLB flushes by deferring (and
> > possibly avoiding) some of them. But we have already bug reports from
> > two different vendors, so apparently something is overlooked here.
> > Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren.
> >
> > @Guo Ren, do you mind if I re-send your v3 patch together with the
> > remaining reverts of my changes ?
> Okay, thx for taking care. Let's make it work around first and then improve it.
>
> Actually, the current riscv asid is from arm64 with hardware broadcast
> requirements. Maybe we need to consider x86 per-cpu asid pool way.

It should be noted that the spec expects supervisor software to
use a consistent meaning of non-zero ASIDs across different harts.

Also, a vendor could implement efficient hardware broadcasting ASID
invalidation with a custom instruction and expose it via SBI.

Best,
Gary

2023-02-28 03:12:59

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

On Tue, Feb 28, 2023 at 6:40 AM Gary Guo <[email protected]> wrote:
>
> On Sat, 25 Feb 2023 23:24:40 -0500
> Guo Ren <[email protected]> wrote:
>
> > On Sat, Feb 25, 2023 at 2:29 PM Sergey Matyukevich <[email protected]> wrote:
> > >
> > > On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote:
> > > > Lad, Prabhakar <[email protected]> 於 2022年12月23日 週五 下午8:54寫道:
> > > > >
> > > > > Hi Guo,
> > > > >
> > > > > Thank you for the patch.
> > > > >
> > > > > On Fri, Nov 11, 2022 at 8:00 AM <[email protected]> wrote:
> > > > > >
> > > > > > From: Guo Ren <[email protected]>
> > > > > >
> > > > > > After use_asid_allocator is enabled, the userspace application will
> > > > > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > > > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > > > > Then set_mm_asid would cause the user space application to get a stale
> > > > > > value by stale TLB entry, but set_mm_noasid is okay.
> > > > > >
> > > > > > Here is the symptom of the bug:
> > > > > > unhandled signal 11 code 0x1 (coredump)
> > > > > > 0x0000003fd6d22524 <+4>: auipc s0,0x70
> > > > > > 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490
> > > > > > => 0x0000003fd6d2252c <+12>: ld a5,0(s0)
> > > > > > (gdb) i r s0
> > > > > > s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21
> > > > > > (gdb) x /2x 0x3fd6d92490
> > > > > > 0x3fd6d92490: 0xd80ac8a8 0x0000003f
> > > > > > The core dump file shows that register s0 is wrong, but the value in
> > > > > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > > > > in TLB and got a wrong result from an incorrect physical address.
> > > > > >
> > > > > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > > > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > > > > PTWed into CPU0's TLB.
> > > > > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > > > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > > > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > > > > ptep_get_and_clear & flush_tlb_page.
> > > > > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > > > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > > > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > > > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > > > > still used a stale TLB mapping entry which contained a wrong target
> > > > > > physical address. It raised a bug when the task happened to read that
> > > > > > value.
> > > > > >
> > > > > > CPU0 CPU1
> > > > > > - switch 'task' in
> > > > > > - read addr (Fill stale mapping
> > > > > > entry into TLB)
> > > > > > - switch 'task' out (no tlb_flush)
> > > > > > - switch 'task' in (no tlb_flush)
> > > > > > - write addr cause pagefault
> > > > > > do_page_fault() (change to
> > > > > > new addr mapping)
> > > > > > wp_page_copy()
> > > > > > ptep_clear_flush()
> > > > > > ptep_get_and_clear()
> > > > > > & flush_tlb_page()
> > > > > > write new value into addr
> > > > > > - switch 'task' out (no tlb_flush)
> > > > > > - switch 'task' in (no tlb_flush)
> > > > > > - read addr again (Use stale
> > > > > > mapping entry in TLB)
> > > > > > get wrong value from old phyical
> > > > > > addr, BUG!
> > > > > >
> > > > > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > > > > which could guarantee to invalidate all stale TLB entries during TLB
> > > > > > flush.
> > > > > >
> > > > > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > > > > Signed-off-by: Guo Ren <[email protected]>
> > > > > > Signed-off-by: Guo Ren <[email protected]>
> > > > > > Cc: Anup Patel <[email protected]>
> > > > > > Cc: Palmer Dabbelt <[email protected]>
> > > > > > ---
> > > > > > Changes in v3:
> > > > > > - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > > > > > with Andrew's advice)
> > > > > > - Optimize comment description
> > > > > >
> > > > > > Changes in v2:
> > > > > > - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > > > > > test robot <[email protected]>)
> > > > > > - Keep cpumask_clear_cpu for noasid
> > > > > > ---
> > > > > > arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > > > > > 1 file changed, 20 insertions(+), 10 deletions(-)
> > > > > >
> > > > > As reported on the patch [0] I was seeing consistent failures on the
> > > > > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > > > > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > > > > seeing this issue.
> > > > >
> > > > > Tested-by: Lad Prabhakar <[email protected]>
> > > > >
> > > > > [0] https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/
> > > > >
> > > >
> > > > Hi all,
> > > > I got the same situation (i.e. unhandle signal 11) on our internal
> > > > multi-core system, I tried the patch[0] & [1], but it still doesn't
> > > > work, I guess there are still some potential problems. After applying
> > > > this patch, the situation disappeared, I took some time to look at
> > > > other arches' implementations, such as arc, they don't clear the
> > > > mm_cpumask due to the similar issue. I can't say which approach might
> > > > be better, but I'd like to point out that this patch works to me.
> > > > Thanks.
> > > >
> > > > Tested-by: Zong Li <[email protected]>
> > > >
> > > > [0] https://lore.kernel.org/linux-riscv/[email protected]/
> > > > [1] https://lore.kernel.org/linux-riscv/[email protected]/
> > >
> > > Thanks for the report! By the way, could you please share some
> > > information about the reproducing workload ?
> > >
> > > Initial idea was to reduce the number of TLB flushes by deferring (and
> > > possibly avoiding) some of them. But we have already bug reports from
> > > two different vendors, so apparently something is overlooked here.
> > > Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren.
> > >
> > > @Guo Ren, do you mind if I re-send your v3 patch together with the
> > > remaining reverts of my changes ?
> > Okay, thx for taking care. Let's make it work around first and then improve it.
> >
> > Actually, the current riscv asid is from arm64 with hardware broadcast
> > requirements. Maybe we need to consider x86 per-cpu asid pool way.
>
> It should be noted that the spec expects supervisor software to
> use a consistent meaning of non-zero ASIDs across different harts.
>
> Also, a vendor could implement efficient hardware broadcasting ASID
> invalidation with a custom instruction and expose it via SBI.
I agree with you; actually, our XuanTie supports hardware broadcasting
invalidation, and we expect SBI style.

The x86 style would be another choice for the future, and the riscv
would combine different TLB maintenance styles into one Linux
architecture.

----

Please let me give out an outline of the current riscv state:

1. Current Linux riscv uses unified ASIDs design, which comes from arm
hw broadcast one. Some riscv vendors (XuanTie) could also support TLB
hw broadcast.

2. The riscv spec doesn't suggest hw broadcast because it's unsuitable
for large-scale systems. So x86 style would be another choice for the
future Linux riscv.

Correct? Welcome feedback.

>
> Best,
> Gary



--
Best Regards
Guo Ren