2021-03-16 09:43:00

by Anup Patel

[permalink] [raw]
Subject: Re: [PATCH] Insert SFENCE.VMA in function set_pte_at for RISCV

On Tue, Mar 16, 2021 at 1:59 PM Andrew Waterman
<[email protected]> wrote:
>
> On Tue, Mar 16, 2021 at 12:32 AM Anup Patel <[email protected]> wrote:
> >
> > On Tue, Mar 16, 2021 at 12:27 PM Jiuyang Liu <[email protected]> wrote:
> > >
> > > > As per my understanding, we don't need to explicitly invalidate local TLB
> > > > in set_pte() or set_pet_at() because generic Linux page table management
> > > > (<linux>/mm/*) will call the appropriate flush_tlb_xyz() function after page
> > > > table updates.
> > >
> > > I witnessed this bug in our micro-architecture: set_pte instruction is
> > > still in the store buffer, no functions are inserting SFENCE.VMA in
> > > the stack below, so TLB cannot witness this modification.
> > > Here is my call stack:
> > > set_pte
> > > set_pte_at
> > > map_vm_area
> > > __vmalloc_area_node
> > > __vmalloc_node_range
> > > __vmalloc_node
> > > __vmalloc_node_flags
> > > vzalloc
> > > n_tty_open
> > >
> > > I think this is an architecture specific code, so <linux>/mm/* should
> > > not be modified.
> > > And spec requires SFENCE.VMA to be inserted on each modification to
> > > TLB. So I added code here.
> >
> > The generic linux/mm/* already calls the appropriate tlb_flush_xyz()
> > function defined in arch/riscv/include/asm/tlbflush.h
> >
> > Better to have a write-barrier in set_pte().
> >
> > >
> > > > Also, just local TLB flush is generally not sufficient because
> > > > a lot of page tables will be used across on multiple HARTs.
> > >
> > > Yes, this is the biggest issue, in RISC-V Volume 2, Privileged Spec v.
> > > 20190608 page 67 gave a solution:
> >
> > This is not an issue with RISC-V privilege spec rather it is more about
> > placing RISC-V fences at right locations.
> >
> > > Consequently, other harts must be notified separately when the
> > > memory-management data structures have been modified. One approach is
> > > to use
> > > 1) a local data fence to ensure local writes are visible globally,
> > > then 2) an interprocessor interrupt to the other thread,
> > > then 3) a local SFENCE.VMA in the interrupt handler of the remote thread,
> > > and finally 4) signal back to originating thread that operation is
> > > complete. This is, of course, the RISC-V analog to a TLB shootdown.
> >
> > I would suggest trying approach#1.
> >
> > You can include "asm/barrier.h" here and use wmb() or __smp_wmb()
> > in-place of local TLB flush.
>
> wmb() doesn't suffice to order older stores before younger page-table
> walks, so that might hide the problem without actually fixing it.

If we assume page-table walks as reads then mb() might be more
suitable in this case ??

ARM64 also has an explicit barrier in set_pte() implementation. They are
doing "dsb(ishst); isb()" which is an inner-shareable store barrier followed
by an instruction barrier.

>
> Based upon Jiuyang's description, it does sound plausible that we are
> missing an SFENCE.VMA (or TLB shootdown) somewhere. But I don't
> understand the situation well enough to know where that might be, or
> what the best fix is.

Yes, I agree but set_pte() doesn't seem to be the right place for TLB
shootdown based on set_pte() implementations of other architectures.

Regards,
Anup

>
>
> >
> > >
> > > In general, this patch didn't handle the G bit in PTE, kernel trap it
> > > to sbi_remote_sfence_vma. do you think I should use flush_tlb_all?
> > >
> > > Jiuyang
> > >
> > >
> > >
> > >
> > > arch/arm/mm/mmu.c
> > > void set_pte_at(struct mm_struct *mm, unsigned long addr,
> > > pte_t *ptep, pte_t pteval)
> > > {
> > > unsigned long ext = 0;
> > >
> > > if (addr < TASK_SIZE && pte_valid_user(pteval)) {
> > > if (!pte_special(pteval))
> > > __sync_icache_dcache(pteval);
> > > ext |= PTE_EXT_NG;
> > > }
> > >
> > > set_pte_ext(ptep, pteval, ext);
> > > }
> > >
> > > arch/mips/include/asm/pgtable.h
> > > static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
> > > pte_t *ptep, pte_t pteval)
> > > {
> > >
> > > if (!pte_present(pteval))
> > > goto cache_sync_done;
> > >
> > > if (pte_present(*ptep) && (pte_pfn(*ptep) == pte_pfn(pteval)))
> > > goto cache_sync_done;
> > >
> > > __update_cache(addr, pteval);
> > > cache_sync_done:
> > > set_pte(ptep, pteval);
> > > }
> > >
> > >
> > > Also, just local TLB flush is generally not sufficient because
> > > > a lot of page tables will be used accross on multiple HARTs.
> > >
> > >
> > > On Tue, Mar 16, 2021 at 5:05 AM Anup Patel <[email protected]> wrote:
> > > >
> > > > +Alex
> > > >
> > > > On Tue, Mar 16, 2021 at 9:20 AM Jiuyang Liu <[email protected]> wrote:
> > > > >
> > > > > This patch inserts SFENCE.VMA after modifying PTE based on RISC-V
> > > > > specification.
> > > > >
> > > > > arch/riscv/include/asm/pgtable.h:
> > > > > 1. implement pte_user, pte_global and pte_leaf to check correspond
> > > > > attribute of a pte_t.
> > > >
> > > > Adding pte_user(), pte_global(), and pte_leaf() is fine.
> > > >
> > > > >
> > > > > 2. insert SFENCE.VMA in set_pte_at based on RISC-V Volume 2, Privileged
> > > > > Spec v. 20190608 page 66 and 67:
> > > > > If software modifies a non-leaf PTE, it should execute SFENCE.VMA with
> > > > > rs1=x0. If any PTE along the traversal path had its G bit set, rs2 must
> > > > > be x0; otherwise, rs2 should be set to the ASID for which the
> > > > > translation is being modified.
> > > > > If software modifies a leaf PTE, it should execute SFENCE.VMA with rs1
> > > > > set to a virtual address within the page. If any PTE along the traversal
> > > > > path had its G bit set, rs2 must be x0; otherwise, rs2 should be set to
> > > > > the ASID for which the translation is being modified.
> > > > >
> > > > > arch/riscv/include/asm/tlbflush.h:
> > > > > 1. implement get_current_asid to get current program asid.
> > > > > 2. implement local_flush_tlb_asid to flush tlb with asid.
> > > >
> > > > As per my understanding, we don't need to explicitly invalidate local TLB
> > > > in set_pte() or set_pet_at() because generic Linux page table management
> > > > (<linux>/mm/*) will call the appropriate flush_tlb_xyz() function after page
> > > > table updates. Also, just local TLB flush is generally not sufficient because
> > > > a lot of page tables will be used accross on multiple HARTs.
> > > >
> > > > >
> > > > > Signed-off-by: Jiuyang Liu <[email protected]>
> > > > > ---
> > > > > arch/riscv/include/asm/pgtable.h | 27 +++++++++++++++++++++++++++
> > > > > arch/riscv/include/asm/tlbflush.h | 12 ++++++++++++
> > > > > 2 files changed, 39 insertions(+)
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> > > > > index ebf817c1bdf4..5a47c60372c1 100644
> > > > > --- a/arch/riscv/include/asm/pgtable.h
> > > > > +++ b/arch/riscv/include/asm/pgtable.h
> > > > > @@ -222,6 +222,16 @@ static inline int pte_write(pte_t pte)
> > > > > return pte_val(pte) & _PAGE_WRITE;
> > > > > }
> > > > >
> > > > > +static inline int pte_user(pte_t pte)
> > > > > +{
> > > > > + return pte_val(pte) & _PAGE_USER;
> > > > > +}
> > > > > +
> > > > > +static inline int pte_global(pte_t pte)
> > > > > +{
> > > > > + return pte_val(pte) & _PAGE_GLOBAL;
> > > > > +}
> > > > > +
> > > > > static inline int pte_exec(pte_t pte)
> > > > > {
> > > > > return pte_val(pte) & _PAGE_EXEC;
> > > > > @@ -248,6 +258,11 @@ static inline int pte_special(pte_t pte)
> > > > > return pte_val(pte) & _PAGE_SPECIAL;
> > > > > }
> > > > >
> > > > > +static inline int pte_leaf(pte_t pte)
> > > > > +{
> > > > > + return pte_val(pte) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC);
> > > > > +}
> > > > > +
> > > > > /* static inline pte_t pte_rdprotect(pte_t pte) */
> > > > >
> > > > > static inline pte_t pte_wrprotect(pte_t pte)
> > > > > @@ -358,6 +373,18 @@ static inline void set_pte_at(struct mm_struct *mm,
> > > > > flush_icache_pte(pteval);
> > > > >
> > > > > set_pte(ptep, pteval);
> > > > > +
> > > > > + if (pte_present(pteval)) {
> > > > > + if (pte_leaf(pteval)) {
> > > > > + local_flush_tlb_page(addr);
> > > > > + } else {
> > > > > + if (pte_global(pteval))
> > > > > + local_flush_tlb_all();
> > > > > + else
> > > > > + local_flush_tlb_asid();
> > > > > +
> > > > > + }
> > > > > + }
> > > > > }
> > > > >
> > > > > static inline void pte_clear(struct mm_struct *mm,
> > > > > diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
> > > > > index 394cfbccdcd9..1f9b62b3670b 100644
> > > > > --- a/arch/riscv/include/asm/tlbflush.h
> > > > > +++ b/arch/riscv/include/asm/tlbflush.h
> > > > > @@ -21,6 +21,18 @@ static inline void local_flush_tlb_page(unsigned long addr)
> > > > > {
> > > > > __asm__ __volatile__ ("sfence.vma %0" : : "r" (addr) : "memory");
> > > > > }
> > > > > +
> > > > > +static inline unsigned long get_current_asid(void)
> > > > > +{
> > > > > + return (csr_read(CSR_SATP) >> SATP_ASID_SHIFT) & SATP_ASID_MASK;
> > > > > +}
> > > > > +
> > > > > +static inline void local_flush_tlb_asid(void)
> > > > > +{
> > > > > + unsigned long asid = get_current_asid();
> > > > > + __asm__ __volatile__ ("sfence.vma x0, %0" : : "r" (asid) : "memory");
> > > > > +}
> > > > > +
> > > > > #else /* CONFIG_MMU */
> > > > > #define local_flush_tlb_all() do { } while (0)
> > > > > #define local_flush_tlb_page(addr) do { } while (0)
> > > > > --
> > > > > 2.30.2
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > linux-riscv mailing list
> > > > > [email protected]
> > > > > http://lists.infradead.org/mailman/listinfo/linux-riscv
> > > >
> > > > Regards,
> > > > Anup
> >
> > Regards,
> > Anup


2021-03-16 15:43:27

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH] Insert SFENCE.VMA in function set_pte_at for RISCV

Le 3/16/21 ? 4:40 AM, Anup Patel a ?crit?:
> On Tue, Mar 16, 2021 at 1:59 PM Andrew Waterman
> <[email protected]> wrote:
>>
>> On Tue, Mar 16, 2021 at 12:32 AM Anup Patel <[email protected]> wrote:
>>>
>>> On Tue, Mar 16, 2021 at 12:27 PM Jiuyang Liu <[email protected]> wrote:
>>>>
>>>>> As per my understanding, we don't need to explicitly invalidate local TLB
>>>>> in set_pte() or set_pet_at() because generic Linux page table management
>>>>> (<linux>/mm/*) will call the appropriate flush_tlb_xyz() function after page
>>>>> table updates.
>>>>
>>>> I witnessed this bug in our micro-architecture: set_pte instruction is
>>>> still in the store buffer, no functions are inserting SFENCE.VMA in
>>>> the stack below, so TLB cannot witness this modification.
>>>> Here is my call stack:
>>>> set_pte
>>>> set_pte_at
>>>> map_vm_area
>>>> __vmalloc_area_node
>>>> __vmalloc_node_range
>>>> __vmalloc_node
>>>> __vmalloc_node_flags
>>>> vzalloc
>>>> n_tty_open
>>>>

I don't find this call stack, what I find is (the other way around):

n_tty_open
vzalloc
__vmalloc_node
__vmalloc_node_range
__vmalloc_area_node
map_kernel_range
-> map_kernel_range_noflush
flush_cache_vmap

Which leads to the fact that we don't have flush_cache_vmap callback
implemented: shouldn't we add the sfence.vma here ? Powerpc does
something similar with "ptesync" (see below) instruction that seems to
do the same as sfence.vma.

ptesync: "The ptesync instruction after the Store instruction ensures
that all searches of the Page Table that are performed after the ptesync
instruction completes will use the value stored"

>>>> I think this is an architecture specific code, so <linux>/mm/* should
>>>> not be modified.
>>>> And spec requires SFENCE.VMA to be inserted on each modification to
>>>> TLB. So I added code here.
>>>
>>> The generic linux/mm/* already calls the appropriate tlb_flush_xyz()
>>> function defined in arch/riscv/include/asm/tlbflush.h
>>>
>>> Better to have a write-barrier in set_pte().
>>>
>>>>
>>>>> Also, just local TLB flush is generally not sufficient because
>>>>> a lot of page tables will be used across on multiple HARTs.
>>>>
>>>> Yes, this is the biggest issue, in RISC-V Volume 2, Privileged Spec v.
>>>> 20190608 page 67 gave a solution:
>>>
>>> This is not an issue with RISC-V privilege spec rather it is more about
>>> placing RISC-V fences at right locations.
>>>
>>>> Consequently, other harts must be notified separately when the
>>>> memory-management data structures have been modified. One approach is
>>>> to use
>>>> 1) a local data fence to ensure local writes are visible globally,
>>>> then 2) an interprocessor interrupt to the other thread,
>>>> then 3) a local SFENCE.VMA in the interrupt handler of the remote thread,
>>>> and finally 4) signal back to originating thread that operation is
>>>> complete. This is, of course, the RISC-V analog to a TLB shootdown.
>>>
>>> I would suggest trying approach#1.
>>>
>>> You can include "asm/barrier.h" here and use wmb() or __smp_wmb()
>>> in-place of local TLB flush.
>>
>> wmb() doesn't suffice to order older stores before younger page-table
>> walks, so that might hide the problem without actually fixing it.
>
> If we assume page-table walks as reads then mb() might be more
> suitable in this case ??
>
> ARM64 also has an explicit barrier in set_pte() implementation. They are
> doing "dsb(ishst); isb()" which is an inner-shareable store barrier followed
> by an instruction barrier.
>
>>
>> Based upon Jiuyang's description, it does sound plausible that we are
>> missing an SFENCE.VMA (or TLB shootdown) somewhere. But I don't
>> understand the situation well enough to know where that might be, or
>> what the best fix is.
>
> Yes, I agree but set_pte() doesn't seem to be the right place for TLB
> shootdown based on set_pte() implementations of other architectures.

I agree as "flushing" the TLB after every set_pte() would be very
costly, it's better to do it once at the end of the all the updates:
like in flush_cache_vmap :)

Alex

>
> Regards,
> Anup
>
>>
>>
>>>
>>>>
>>>> In general, this patch didn't handle the G bit in PTE, kernel trap it
>>>> to sbi_remote_sfence_vma. do you think I should use flush_tlb_all?
>>>>
>>>> Jiuyang
>>>>
>>>>
>>>>
>>>>
>>>> arch/arm/mm/mmu.c
>>>> void set_pte_at(struct mm_struct *mm, unsigned long addr,
>>>> pte_t *ptep, pte_t pteval)
>>>> {
>>>> unsigned long ext = 0;
>>>>
>>>> if (addr < TASK_SIZE && pte_valid_user(pteval)) {
>>>> if (!pte_special(pteval))
>>>> __sync_icache_dcache(pteval);
>>>> ext |= PTE_EXT_NG;
>>>> }
>>>>
>>>> set_pte_ext(ptep, pteval, ext);
>>>> }
>>>>
>>>> arch/mips/include/asm/pgtable.h
>>>> static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>>>> pte_t *ptep, pte_t pteval)
>>>> {
>>>>
>>>> if (!pte_present(pteval))
>>>> goto cache_sync_done;
>>>>
>>>> if (pte_present(*ptep) && (pte_pfn(*ptep) == pte_pfn(pteval)))
>>>> goto cache_sync_done;
>>>>
>>>> __update_cache(addr, pteval);
>>>> cache_sync_done:
>>>> set_pte(ptep, pteval);
>>>> }
>>>>
>>>>
>>>> Also, just local TLB flush is generally not sufficient because
>>>>> a lot of page tables will be used accross on multiple HARTs.
>>>>
>>>>
>>>> On Tue, Mar 16, 2021 at 5:05 AM Anup Patel <[email protected]> wrote:
>>>>>
>>>>> +Alex
>>>>>
>>>>> On Tue, Mar 16, 2021 at 9:20 AM Jiuyang Liu <[email protected]> wrote:
>>>>>>
>>>>>> This patch inserts SFENCE.VMA after modifying PTE based on RISC-V
>>>>>> specification.
>>>>>>
>>>>>> arch/riscv/include/asm/pgtable.h:
>>>>>> 1. implement pte_user, pte_global and pte_leaf to check correspond
>>>>>> attribute of a pte_t.
>>>>>
>>>>> Adding pte_user(), pte_global(), and pte_leaf() is fine.
>>>>>
>>>>>>
>>>>>> 2. insert SFENCE.VMA in set_pte_at based on RISC-V Volume 2, Privileged
>>>>>> Spec v. 20190608 page 66 and 67:
>>>>>> If software modifies a non-leaf PTE, it should execute SFENCE.VMA with
>>>>>> rs1=x0. If any PTE along the traversal path had its G bit set, rs2 must
>>>>>> be x0; otherwise, rs2 should be set to the ASID for which the
>>>>>> translation is being modified.
>>>>>> If software modifies a leaf PTE, it should execute SFENCE.VMA with rs1
>>>>>> set to a virtual address within the page. If any PTE along the traversal
>>>>>> path had its G bit set, rs2 must be x0; otherwise, rs2 should be set to
>>>>>> the ASID for which the translation is being modified.
>>>>>>
>>>>>> arch/riscv/include/asm/tlbflush.h:
>>>>>> 1. implement get_current_asid to get current program asid.
>>>>>> 2. implement local_flush_tlb_asid to flush tlb with asid.
>>>>>
>>>>> As per my understanding, we don't need to explicitly invalidate local TLB
>>>>> in set_pte() or set_pet_at() because generic Linux page table management
>>>>> (<linux>/mm/*) will call the appropriate flush_tlb_xyz() function after page
>>>>> table updates. Also, just local TLB flush is generally not sufficient because
>>>>> a lot of page tables will be used accross on multiple HARTs.
>>>>>
>>>>>>
>>>>>> Signed-off-by: Jiuyang Liu <[email protected]>
>>>>>> ---
>>>>>> arch/riscv/include/asm/pgtable.h | 27 +++++++++++++++++++++++++++
>>>>>> arch/riscv/include/asm/tlbflush.h | 12 ++++++++++++
>>>>>> 2 files changed, 39 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>>>>>> index ebf817c1bdf4..5a47c60372c1 100644
>>>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>>>> @@ -222,6 +222,16 @@ static inline int pte_write(pte_t pte)
>>>>>> return pte_val(pte) & _PAGE_WRITE;
>>>>>> }
>>>>>>
>>>>>> +static inline int pte_user(pte_t pte)
>>>>>> +{
>>>>>> + return pte_val(pte) & _PAGE_USER;
>>>>>> +}
>>>>>> +
>>>>>> +static inline int pte_global(pte_t pte)
>>>>>> +{
>>>>>> + return pte_val(pte) & _PAGE_GLOBAL;
>>>>>> +}
>>>>>> +
>>>>>> static inline int pte_exec(pte_t pte)
>>>>>> {
>>>>>> return pte_val(pte) & _PAGE_EXEC;
>>>>>> @@ -248,6 +258,11 @@ static inline int pte_special(pte_t pte)
>>>>>> return pte_val(pte) & _PAGE_SPECIAL;
>>>>>> }
>>>>>>
>>>>>> +static inline int pte_leaf(pte_t pte)
>>>>>> +{
>>>>>> + return pte_val(pte) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC);
>>>>>> +}
>>>>>> +
>>>>>> /* static inline pte_t pte_rdprotect(pte_t pte) */
>>>>>>
>>>>>> static inline pte_t pte_wrprotect(pte_t pte)
>>>>>> @@ -358,6 +373,18 @@ static inline void set_pte_at(struct mm_struct *mm,
>>>>>> flush_icache_pte(pteval);
>>>>>>
>>>>>> set_pte(ptep, pteval);
>>>>>> +
>>>>>> + if (pte_present(pteval)) {
>>>>>> + if (pte_leaf(pteval)) {
>>>>>> + local_flush_tlb_page(addr);
>>>>>> + } else {
>>>>>> + if (pte_global(pteval))
>>>>>> + local_flush_tlb_all();
>>>>>> + else
>>>>>> + local_flush_tlb_asid();
>>>>>> +
>>>>>> + }
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> static inline void pte_clear(struct mm_struct *mm,
>>>>>> diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
>>>>>> index 394cfbccdcd9..1f9b62b3670b 100644
>>>>>> --- a/arch/riscv/include/asm/tlbflush.h
>>>>>> +++ b/arch/riscv/include/asm/tlbflush.h
>>>>>> @@ -21,6 +21,18 @@ static inline void local_flush_tlb_page(unsigned long addr)
>>>>>> {
>>>>>> __asm__ __volatile__ ("sfence.vma %0" : : "r" (addr) : "memory");
>>>>>> }
>>>>>> +
>>>>>> +static inline unsigned long get_current_asid(void)
>>>>>> +{
>>>>>> + return (csr_read(CSR_SATP) >> SATP_ASID_SHIFT) & SATP_ASID_MASK;
>>>>>> +}
>>>>>> +
>>>>>> +static inline void local_flush_tlb_asid(void)
>>>>>> +{
>>>>>> + unsigned long asid = get_current_asid();
>>>>>> + __asm__ __volatile__ ("sfence.vma x0, %0" : : "r" (asid) : "memory");
>>>>>> +}
>>>>>> +
>>>>>> #else /* CONFIG_MMU */
>>>>>> #define local_flush_tlb_all() do { } while (0)
>>>>>> #define local_flush_tlb_page(addr) do { } while (0)
>>>>>> --
>>>>>> 2.30.2
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> linux-riscv mailing list
>>>>>> [email protected]
>>>>>> http://lists.infradead.org/mailman/listinfo/linux-riscv
>>>>>
>>>>> Regards,
>>>>> Anup
>>>
>>> Regards,
>>> Anup
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv
>