Date:   Sat, 25 Feb 2023 22:29:04 +0300
From:   Sergey Matyukevich <geomatsi@gmail.com>
To:     Zong Li <zongbox@gmail.com>, Guo Ren <guoren@linux.alibaba.com>,
        guoren@kernel.org
Cc:     "Lad, Prabhakar" <prabhakar.csengg@gmail.com>, guoren@kernel.org,
        anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com,
        conor.dooley@microchip.com, heiko@sntech.de,
        philipp.tomsich@vrull.eu, alex@ghiti.fr, hch@lst.de,
        ajones@ventanamicro.com, gary@garyguo.net, jszhang@kernel.org,
        linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org,
        Guo Ren <guoren@linux.alibaba.com>,
        Anup Patel <apatel@ventanamicro.com>,
        Palmer Dabbelt <palmer@rivosinc.com>,
        Zong Li <zong.li@sifive.com>
Subject: Re: [PATCH V3] riscv: asid: Fixup stale TLB entry cause application
 crash
Message-ID: <Y/phgGFWMf/4WRSS@curiosity>
References: <20221111075902.798571-1-guoren@kernel.org>
 <CA+V-a8tbFhefuYF0UrGNrKZn6CpHEUhOvsf4GNmdLza0gWvf=w@mail.gmail.com>
 <CA+ZOyah9dBzzkHyy6wxk+hok3K1YrR9h+VdA3aTW5+m9ne04SQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CA+ZOyah9dBzzkHyy6wxk+hok3K1YrR9h+VdA3aTW5+m9ne04SQ@mail.gmail.com>
Precedence: bulk

On Fri, Feb 24, 2023 at 01:57:55AM +0800, Zong Li wrote:
> Lad, Prabhakar <prabhakar.csengg@gmail.com> 於 2022年12月23日 週五 下午8:54寫道：
> >
> > Hi Guo,
> >
> > Thank you for the patch.
> >
> > On Fri, Nov 11, 2022 at 8:00 AM <guoren@kernel.org> wrote:
> > >
> > > From: Guo Ren <guoren@linux.alibaba.com>
> > >
> > > After use_asid_allocator is enabled, the userspace application will
> > > crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > > local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > > Then set_mm_asid would cause the user space application to get a stale
> > > value by stale TLB entry, but set_mm_noasid is okay.
> > >
> > > Here is the symptom of the bug:
> > > unhandled signal 11 code 0x1 (coredump)
> > >    0x0000003fd6d22524 <+4>:     auipc   s0,0x70
> > >    0x0000003fd6d22528 <+8>:     ld      s0,-148(s0) # 0x3fd6d92490
> > > => 0x0000003fd6d2252c <+12>:    ld      a5,0(s0)
> > > (gdb) i r s0
> > > s0          0x8082ed1cc3198b21       0x8082ed1cc3198b21
> > > (gdb) x /2x 0x3fd6d92490
> > > 0x3fd6d92490:   0xd80ac8a8      0x0000003f
> > > The core dump file shows that register s0 is wrong, but the value in
> > > memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry
> > > in TLB and got a wrong result from an incorrect physical address.
> > >
> > > When the task ran on CPU0, which loaded/speculative-loaded the value of
> > > address(0x3fd6d92490), then the first version of the mapping entry was
> > > PTWed into CPU0's TLB.
> > > When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by
> > > asid), it happened to write a value on the address (0x3fd6d92490). It
> > > caused do_page_fault -> wp_page_copy -> ptep_clear_flush ->
> > > ptep_get_and_clear & flush_tlb_page.
> > > The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB
> > > flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous
> > > switch_mm. So we only flushed the CPU1 TLB and set the second version
> > > mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0
> > > still used a stale TLB mapping entry which contained a wrong target
> > > physical address. It raised a bug when the task happened to read that
> > > value.
> > >
> > >    CPU0                               CPU1
> > >    - switch 'task' in
> > >    - read addr (Fill stale mapping
> > >      entry into TLB)
> > >    - switch 'task' out (no tlb_flush)
> > >                                       - switch 'task' in (no tlb_flush)
> > >                                       - write addr cause pagefault
> > >                                         do_page_fault() (change to
> > >                                         new addr mapping)
> > >                                           wp_page_copy()
> > >                                             ptep_clear_flush()
> > >                                               ptep_get_and_clear()
> > >                                               & flush_tlb_page()
> > >                                         write new value into addr
> > >                                       - switch 'task' out (no tlb_flush)
> > >    - switch 'task' in (no tlb_flush)
> > >    - read addr again (Use stale
> > >      mapping entry in TLB)
> > >      get wrong value from old phyical
> > >      addr, BUG!
> > >
> > > The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm,
> > > which could guarantee to invalidate all stale TLB entries during TLB
> > > flush.
> > >
> > > Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator")
> > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > > Signed-off-by: Guo Ren <guoren@kernel.org>
> > > Cc: Anup Patel <apatel@ventanamicro.com>
> > > Cc: Palmer Dabbelt <palmer@rivosinc.com>
> > > ---
> > > Changes in v3:
> > >  - Move set/clear cpumask(mm) into set_mm (Make code more pretty
> > >    with Andrew's advice)
> > >  - Optimize comment description
> > >
> > > Changes in v2:
> > >  - Fixup nommu compile problem (Thx Conor, Also Reported-by: kernel
> > >    test robot <lkp@intel.com>)
> > >  - Keep cpumask_clear_cpu for noasid
> > > ---
> > >  arch/riscv/mm/context.c | 30 ++++++++++++++++++++----------
> > >  1 file changed, 20 insertions(+), 10 deletions(-)
> > >
> > As reported on the patch [0] I was seeing consistent failures on the
> > RZ/Five SoC while running bonnie++ utility. After applying this patch
> > on top of Palmer's for-next branch (eb67d239f3aa) I am no longer
> > seeing this issue.
> >
> > Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> >
> > [0] https://patchwork.kernel.org/project/linux-riscv/patch/20220829205219.283543-1-geomatsi@gmail.com/
> >
> 
> Hi all,
> I got the same situation (i.e. unhandle signal 11) on our internal
> multi-core system, I tried the patch[0] & [1], but it still doesn't
> work, I guess there are still some potential problems. After applying
> this patch, the situation disappeared, I took some time to look at
> other arches' implementations, such as arc, they don't clear the
> mm_cpumask due to the similar issue. I can't say which approach might
> be better, but I'd like to point out that this patch works to me.
> Thanks.
> 
> Tested-by: Zong Li <zong.li@sifive.com>
> 
> [0] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> [1] https://lore.kernel.org/linux-riscv/20230129211818.686557-1-geomatsi@gmail.com/

Thanks for the report! By the way, could you please share some
information about the reproducing workload ?

Initial idea was to reduce the number of TLB flushes by deferring (and
possibly avoiding) some of them. But we have already bug reports from
two different vendors, so apparently something is overlooked here.
Lets switch to 'aggrregating' mm_cpumask approach suggested by Guo Ren.

@Guo Ren, do you mind if I re-send your v3 patch together with the
remaining reverts of my changes ?

Regards,
Sergey