Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <Y3fxsWPLPlKGfMBj@curiosity> <mhng-1d55338a-53a1-42eb-bf5c-91655fde2734@palmer-ri-x1c9a>
In-Reply-To: <mhng-1d55338a-53a1-42eb-bf5c-91655fde2734@palmer-ri-x1c9a>
From:   Guo Ren <guoren@kernel.org>
Date:   Fri, 9 Dec 2022 11:13:28 +0800
Message-ID: <CAJF2gTQqF8GG77SKm9MpJC28DQebHsjuSoD_+FRRV0GcKpQFCg@mail.gmail.com>
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash
To:     Palmer Dabbelt <palmer@rivosinc.com>, geomatsi@gmail.com,
        philipp.tomsich@vrull.eu, Paul Walmsley <paul.walmsley@sifive.com>,
        anup@brainfault.org
Cc:     Conor Dooley <conor.dooley@microchip.com>, heiko@sntech.de,
        alex@ghiti.fr, Christoph Hellwig <hch@lst.de>,
        ajones@ventanamicro.com, gary@garyguo.net, jszhang@kernel.org,
        linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org,
        guoren@linux.alibaba.com, apatel@ventanamicro.com
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Fri, Dec 9, 2022 at 7:30 AM Palmer Dabbelt <palmer@rivosinc.com> wrote:
>
> On Fri, 18 Nov 2022 12:57:21 PST (-0800), geomatsi@gmail.com wrote:
> > Hi Guo Ren,
> >
> >
> >> After use_asid_allocator is enabled, the userspace application will
> >> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> >> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> >> Then set_mm_asid would cause the user space application to get a stale
> >> value by stale TLB entry, but set_mm_noasid is okay.
> >
> > ... [snip]
> >
> >> +    /*
> >> +     * The mm_cpumask indicates which harts' TLBs contain the virtual
> >> +     * address mapping of the mm. Compared to noasid, using asid
> >> +     * can't guarantee that stale TLB entries are invalidated because
> >> +     * the asid mechanism wouldn't flush TLB for every switch_mm for
> >> +     * performance. So when using asid, keep all CPUs footmarks in
> >> +     * cpumask() until mm reset.
> >> +     */
> >> +    cpumask_set_cpu(cpu, mm_cpumask(next));
> >> +    if (static_branch_unlikely(&use_asid_allocator)) {
> >> +            set_mm_asid(next, cpu);
> >> +    } else {
> >> +            cpumask_clear_cpu(cpu, mm_cpumask(prev));
> >> +            set_mm_noasid(next);
> >> +    }
> >>  }
> >
> > I observe similar user-space crashes on my SMP systems with enabled ASID.
> > My attempt to fix the issue was a bit different, see the following patch:
> >
> > https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> >
> > In brief, the idea was borrowed from flush_icache_mm handling:
> > - keep track of CPUs not running the task
> > - perform per-ASID TLB flush on such CPUs only if the task is switched there
>
> That way looks better to me: leaking hartids in the ASID allocator might
> make the crashes go away, but it's just going to end up trending towards
> flushing everything and that doesn't seem like the right long-term
> solution.
The penalty in switch_mm is too heavy!!!
 - If the system has multiple NUMA nodes, it will cause switch_mm_fast
flush unnecessary harts.
 - If flush_range is just 1 entry, it would case flush_tlb_all_asid.

switch_mm_fast:
        csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
                  ((cntx & asid_mask) << SATP_ASID_SHIFT) |
                  satp_mode);

        if (need_flush_tlb)
                local_flush_tlb_all();
+#ifdef CONFIG_SMP
+        else {
+                cpumask_t *mask = &mm->context.tlb_stale_mask;+
+
+               if (cpumask_test_cpu(cpu, mask)) {
+                        cpumask_clear_cpu(cpu, mask);
+                        local_flush_tlb_all_asid(cntx & asid_mask);
// penalty in switch_mm fast path
+                }
+        }
+#endif

And See:
 static void __sbi_tlb_flush_range(struct mm_struct *mm, unsigned long start,
                                  unsigned long size, unsigned long stride)
 {
+       struct cpumask *pmask = &mm->context.tlb_stale_mask;
        struct cpumask *cmask = mm_cpumask(mm);
        unsigned int cpuid;
        bool broadcast;
@@ -44,6 +29,15 @@ static void __sbi_tlb_flush_range(struct mm_struct
*mm, unsigned long start,
        if (static_branch_unlikely(&use_asid_allocator)) {
                unsigned long asid = atomic_long_read(&mm->context.id);

+               /*
+                * TLB will be immediately flushed on harts concurrently
+                * executing this MM context. TLB flush on other harts
+                * is deferred until this MM context migrates there.
+                */
+               cpumask_setall(pmask);
                 ^^^^^^^^^^^^^^^^^^^^^^^ It would flush all harts for
all NUMA nodes!!! Most of them deferred to switch_mm_fast. The penalty
contains unnecessary harts!
+               cpumask_clear_cpu(cpuid, pmask);
+               cpumask_andnot(pmask, pmask, cmask);

Please reconsider a bit, and make a smart decision. Just penalty the
harts who touched the mm, not all. And only flush the whole TLB when
some entries are needed.

The __sbi_tlb_flush_range is the slow path; keeping the fast path
performance is worth more than improving a slow one.

>
> So I've got that one on for-next, sorry I missed it before.
>
> Thanks!
>
> >
> > Your patch also works fine in my tests fixing those crashes. I have a
> > question though, regarding removed cpumask_clear_cpu. How CPUs no more
> > running the task are removed from its mm_cpumask ? If they are not
> > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> > TLB flushes to those CPUs when ASID is enabled.
> >
> > Regards,
> > Sergey


-- 
Best Regards
 Guo Ren