2022-09-28 00:53:35

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <[email protected]> wrote:
>
> On 2022/9/27 14:16, Anshuman Khandual wrote:
> > [...]
> >
> > On 9/21/22 14:13, Yicong Yang wrote:
> >> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >> +{
> >> + /* for small systems with small number of CPUs, TLB shootdown is cheap */
> >> + if (num_online_cpus() <= 4)
> >
> > It would be great to have some more inputs from others, whether 4 (which should
> > to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
> > is optimal for an wide range of arm64 platforms.
> >

I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
with 5,6,7
cores.
I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
this patch.

so it seems safe to have
if (num_online_cpus() < 8)

>
> Do you prefer this macro to be static or make it configurable through kconfig then
> different platforms can make choice based on their own situations? It maybe hard to
> test on all the arm64 platforms.

Maybe we can have this default enabled on machines with 8 and more cpus and
provide a tlbflush_batched = on or off to allow users enable or
disable it according
to their hardware and products. Similar example: rodata=on or off.

Hi Anshuman, Will, Catalin, Andrew,
what do you think about this approach?

BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
https://lore.kernel.org/lkml/[email protected]/

I do believe we need it based on the expensive cost of tlb shootdown in arm64
even by hardware broadcast.

>
> Thanks.
>
> >> + return false;> +
> >> +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
> >> + if (unlikely(this_cpu_has_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
> >> + return false;
> >> +#endif
> >> +
> >> + return true;
> >> +}
> >> +
> >
> > [...]
> >
> > .
> >

Thanks
Barry


2022-10-27 10:54:04

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation



On 9/28/22 05:53, Barry Song wrote:
> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <[email protected]> wrote:
>>
>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>> [...]
>>>
>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>> +{
>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>> + if (num_online_cpus() <= 4)
>>>
>>> It would be great to have some more inputs from others, whether 4 (which should
>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>> is optimal for an wide range of arm64 platforms.
>>>
>
> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
> with 5,6,7
> cores.
> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
> this patch.
>
> so it seems safe to have
> if (num_online_cpus() < 8)
>
>>
>> Do you prefer this macro to be static or make it configurable through kconfig then
>> different platforms can make choice based on their own situations? It maybe hard to
>> test on all the arm64 platforms.
>
> Maybe we can have this default enabled on machines with 8 and more cpus and
> provide a tlbflush_batched = on or off to allow users enable or
> disable it according
> to their hardware and products. Similar example: rodata=on or off.

No, sounds bit excessive. Kernel command line options should not be added
for every possible run time switch options.

>
> Hi Anshuman, Will, Catalin, Andrew,
> what do you think about this approach?
>
> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
> https://lore.kernel.org/lkml/[email protected]/
>
> I do believe we need it based on the expensive cost of tlb shootdown in arm64
> even by hardware broadcast.

Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
with CONFIG_EXPERT and for num_online_cpus() > 8 ?

2022-10-27 14:39:28

by Punit Agrawal

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation


[ Apologies for chiming in late in the conversation ]

Anshuman Khandual <[email protected]> writes:

> On 9/28/22 05:53, Barry Song wrote:
>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <[email protected]> wrote:
>>>
>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>> [...]
>>>>
>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>> +{
>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>>> + if (num_online_cpus() <= 4)
>>>>
>>>> It would be great to have some more inputs from others, whether 4 (which should
>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>>> is optimal for an wide range of arm64 platforms.
>>>>
>>
>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>> with 5,6,7
>> cores.
>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>> this patch.
>>
>> so it seems safe to have
>> if (num_online_cpus() < 8)
>>
>>>
>>> Do you prefer this macro to be static or make it configurable through kconfig then
>>> different platforms can make choice based on their own situations? It maybe hard to
>>> test on all the arm64 platforms.
>>
>> Maybe we can have this default enabled on machines with 8 and more cpus and
>> provide a tlbflush_batched = on or off to allow users enable or
>> disable it according
>> to their hardware and products. Similar example: rodata=on or off.
>
> No, sounds bit excessive. Kernel command line options should not be added
> for every possible run time switch options.
>
>>
>> Hi Anshuman, Will, Catalin, Andrew,
>> what do you think about this approach?
>>
>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>> even by hardware broadcast.
>
> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
> with CONFIG_EXPERT and for num_online_cpus() > 8 ?

When running the test program in the commit in a VM, I saw benefits from
the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
ptep_clear_flush() went from ~1% in the unpatched version to not showing
up.

Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
there any overhead? I am wondering what are the downsides of enabling
the config by default.

Thanks,
Punit

2022-10-27 22:22:11

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal
<[email protected]> wrote:
>
>
> [ Apologies for chiming in late in the conversation ]
>
> Anshuman Khandual <[email protected]> writes:
>
> > On 9/28/22 05:53, Barry Song wrote:
> >> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <[email protected]> wrote:
> >>>
> >>> On 2022/9/27 14:16, Anshuman Khandual wrote:
> >>>> [...]
> >>>>
> >>>> On 9/21/22 14:13, Yicong Yang wrote:
> >>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>>>> +{
> >>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */
> >>>>> + if (num_online_cpus() <= 4)
> >>>>
> >>>> It would be great to have some more inputs from others, whether 4 (which should
> >>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
> >>>> is optimal for an wide range of arm64 platforms.
> >>>>
> >>
> >> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
> >> with 5,6,7
> >> cores.
> >> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
> >> this patch.
> >>
> >> so it seems safe to have
> >> if (num_online_cpus() < 8)
> >>
> >>>
> >>> Do you prefer this macro to be static or make it configurable through kconfig then
> >>> different platforms can make choice based on their own situations? It maybe hard to
> >>> test on all the arm64 platforms.
> >>
> >> Maybe we can have this default enabled on machines with 8 and more cpus and
> >> provide a tlbflush_batched = on or off to allow users enable or
> >> disable it according
> >> to their hardware and products. Similar example: rodata=on or off.
> >
> > No, sounds bit excessive. Kernel command line options should not be added
> > for every possible run time switch options.
> >
> >>
> >> Hi Anshuman, Will, Catalin, Andrew,
> >> what do you think about this approach?
> >>
> >> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
> >> https://lore.kernel.org/lkml/[email protected]/
> >>
> >> I do believe we need it based on the expensive cost of tlb shootdown in arm64
> >> even by hardware broadcast.
> >
> > Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
> > with CONFIG_EXPERT and for num_online_cpus() > 8 ?
>
> When running the test program in the commit in a VM, I saw benefits from
> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
> ptep_clear_flush() went from ~1% in the unpatched version to not showing
> up.
>
> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
> there any overhead? I am wondering what are the downsides of enabling
> the config by default.

As we are deferring tlb flush, but sometimes while we are modifying the vma
which are deferred, we need to do a sync by flush_tlb_batched_pending() in
mprotect() , madvise() to make sure they can see the flushed result. if nobody
is doing mprotect(), madvise() etc in the deferred period, the overhead is zero.

>
> Thanks,
> Punit

Thanks
Barry

2022-10-27 22:42:15

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

On Thu, Oct 27, 2022 at 11:42 PM Anshuman Khandual
<[email protected]> wrote:
>
>
>
> On 9/28/22 05:53, Barry Song wrote:
> > On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <[email protected]> wrote:
> >>
> >> On 2022/9/27 14:16, Anshuman Khandual wrote:
> >>> [...]
> >>>
> >>> On 9/21/22 14:13, Yicong Yang wrote:
> >>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>>> +{
> >>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */
> >>>> + if (num_online_cpus() <= 4)
> >>>
> >>> It would be great to have some more inputs from others, whether 4 (which should
> >>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
> >>> is optimal for an wide range of arm64 platforms.
> >>>
> >
> > I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
> > with 5,6,7
> > cores.
> > I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
> > this patch.
> >
> > so it seems safe to have
> > if (num_online_cpus() < 8)
> >
> >>
> >> Do you prefer this macro to be static or make it configurable through kconfig then
> >> different platforms can make choice based on their own situations? It maybe hard to
> >> test on all the arm64 platforms.
> >
> > Maybe we can have this default enabled on machines with 8 and more cpus and
> > provide a tlbflush_batched = on or off to allow users enable or
> > disable it according
> > to their hardware and products. Similar example: rodata=on or off.
>
> No, sounds bit excessive. Kernel command line options should not be added
> for every possible run time switch options.
>
> >
> > Hi Anshuman, Will, Catalin, Andrew,
> > what do you think about this approach?
> >
> > BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > I do believe we need it based on the expensive cost of tlb shootdown in arm64
> > even by hardware broadcast.
>
> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
> with CONFIG_EXPERT and for num_online_cpus() > 8 ?

Sounds good to me. It is a good start to bring up tlb batched flush in
ARM64. Later on, we
might want to see it in both memory reclamation and migration.

Thanks
Barry

2022-10-28 02:13:14

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation



On 10/28/22 03:37, Barry Song wrote:
> On Thu, Oct 27, 2022 at 11:42 PM Anshuman Khandual
> <[email protected]> wrote:
>>
>>
>>
>> On 9/28/22 05:53, Barry Song wrote:
>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <[email protected]> wrote:
>>>>
>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>> [...]
>>>>>
>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>> +{
>>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>>>> + if (num_online_cpus() <= 4)
>>>>>
>>>>> It would be great to have some more inputs from others, whether 4 (which should
>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>>>> is optimal for an wide range of arm64 platforms.
>>>>>
>>>
>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>> with 5,6,7
>>> cores.
>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>>> this patch.
>>>
>>> so it seems safe to have
>>> if (num_online_cpus() < 8)
>>>
>>>>
>>>> Do you prefer this macro to be static or make it configurable through kconfig then
>>>> different platforms can make choice based on their own situations? It maybe hard to
>>>> test on all the arm64 platforms.
>>>
>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>>> provide a tlbflush_batched = on or off to allow users enable or
>>> disable it according
>>> to their hardware and products. Similar example: rodata=on or off.
>>
>> No, sounds bit excessive. Kernel command line options should not be added
>> for every possible run time switch options.
>>
>>>
>>> Hi Anshuman, Will, Catalin, Andrew,
>>> what do you think about this approach?
>>>
>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>>> even by hardware broadcast.
>>
>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
>> with CONFIG_EXPERT and for num_online_cpus() > 8 ?
>
> Sounds good to me. It is a good start to bring up tlb batched flush in
> ARM64. Later on, we
> might want to see it in both memory reclamation and migration.

Right, that is the idea, CONFIG_EXPERT gives an way to test this out for some time
on various platforms, and later it can be dropped off. Regarding num_online_cpus()
= '8' as the threshold which would potentially give benefit of batched TLB should
be defined as a macro e.g NR_CPUS_FOR_BATCHED_TLB or internal (non user selectable)
config , with a proper in-code comment, explaining the rationale.

2022-10-28 02:34:44

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation



On 10/28/22 03:25, Barry Song wrote:
> On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal
> <[email protected]> wrote:
>>
>> [ Apologies for chiming in late in the conversation ]
>>
>> Anshuman Khandual <[email protected]> writes:
>>
>>> On 9/28/22 05:53, Barry Song wrote:
>>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <[email protected]> wrote:
>>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>>> [...]
>>>>>>
>>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>>>>> + if (num_online_cpus() <= 4)
>>>>>> It would be great to have some more inputs from others, whether 4 (which should
>>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>>>>> is optimal for an wide range of arm64 platforms.
>>>>>>
>>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>>> with 5,6,7
>>>> cores.
>>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>>>> this patch.
>>>>
>>>> so it seems safe to have
>>>> if (num_online_cpus() < 8)
>>>>
>>>>> Do you prefer this macro to be static or make it configurable through kconfig then
>>>>> different platforms can make choice based on their own situations? It maybe hard to
>>>>> test on all the arm64 platforms.
>>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>>>> provide a tlbflush_batched = on or off to allow users enable or
>>>> disable it according
>>>> to their hardware and products. Similar example: rodata=on or off.
>>> No, sounds bit excessive. Kernel command line options should not be added
>>> for every possible run time switch options.
>>>
>>>> Hi Anshuman, Will, Catalin, Andrew,
>>>> what do you think about this approach?
>>>>
>>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>>>> even by hardware broadcast.
>>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
>>> with CONFIG_EXPERT and for num_online_cpus() > 8 ?
>> When running the test program in the commit in a VM, I saw benefits from
>> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
>> ptep_clear_flush() went from ~1% in the unpatched version to not showing
>> up.
>>
>> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
>> there any overhead? I am wondering what are the downsides of enabling
>> the config by default.
> As we are deferring tlb flush, but sometimes while we are modifying the vma
> which are deferred, we need to do a sync by flush_tlb_batched_pending() in
> mprotect() , madvise() to make sure they can see the flushed result. if nobody
> is doing mprotect(), madvise() etc in the deferred period, the overhead is zero.

Right, it is difficult to justify this overhead for smaller systems,
which for sure would not benefit from this batched TLB framework.

2022-10-28 13:47:02

by Punit Agrawal

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

Anshuman Khandual <[email protected]> writes:

> On 10/28/22 03:25, Barry Song wrote:
>> On Fri, Oct 28, 2022 at 3:19 AM Punit Agrawal
>> <[email protected]> wrote:
>>>
>>> [ Apologies for chiming in late in the conversation ]
>>>
>>> Anshuman Khandual <[email protected]> writes:
>>>
>>>> On 9/28/22 05:53, Barry Song wrote:
>>>>> On Tue, Sep 27, 2022 at 10:15 PM Yicong Yang <[email protected]> wrote:
>>>>>> On 2022/9/27 14:16, Anshuman Khandual wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>> On 9/21/22 14:13, Yicong Yang wrote:
>>>>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>>>>> +{
>>>>>>>> + /* for small systems with small number of CPUs, TLB shootdown is cheap */
>>>>>>>> + if (num_online_cpus() <= 4)
>>>>>>> It would be great to have some more inputs from others, whether 4 (which should
>>>>>>> to be codified into a macro e.g ARM64_NR_CPU_DEFERRED_TLB, or something similar)
>>>>>>> is optimal for an wide range of arm64 platforms.
>>>>>>>
>>>>> I have tested it on a 4-cpus and 8-cpus machine. but i have no machine
>>>>> with 5,6,7
>>>>> cores.
>>>>> I saw improvement on 8-cpus machines and I found 4-cpus machines don't need
>>>>> this patch.
>>>>>
>>>>> so it seems safe to have
>>>>> if (num_online_cpus() < 8)
>>>>>
>>>>>> Do you prefer this macro to be static or make it configurable through kconfig then
>>>>>> different platforms can make choice based on their own situations? It maybe hard to
>>>>>> test on all the arm64 platforms.
>>>>> Maybe we can have this default enabled on machines with 8 and more cpus and
>>>>> provide a tlbflush_batched = on or off to allow users enable or
>>>>> disable it according
>>>>> to their hardware and products. Similar example: rodata=on or off.
>>>> No, sounds bit excessive. Kernel command line options should not be added
>>>> for every possible run time switch options.
>>>>
>>>>> Hi Anshuman, Will, Catalin, Andrew,
>>>>> what do you think about this approach?
>>>>>
>>>>> BTW, haoxin mentioned another important user scenarios for tlb bach on arm64:
>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>
>>>>> I do believe we need it based on the expensive cost of tlb shootdown in arm64
>>>>> even by hardware broadcast.
>>>> Alright, for now could we enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH selectively
>>>> with CONFIG_EXPERT and for num_online_cpus() > 8 ?
>>> When running the test program in the commit in a VM, I saw benefits from
>>> the patches at all sizes from 2, 4, 8, 32 vcpus. On the test machine,
>>> ptep_clear_flush() went from ~1% in the unpatched version to not showing
>>> up.
>>>
>>> Yicong mentioned that he didn't see any benefit for <= 4 CPUs but is
>>> there any overhead? I am wondering what are the downsides of enabling
>>> the config by default.
>> As we are deferring tlb flush, but sometimes while we are modifying the vma
>> which are deferred, we need to do a sync by flush_tlb_batched_pending() in
>> mprotect() , madvise() to make sure they can see the flushed result. if nobody
>> is doing mprotect(), madvise() etc in the deferred period, the overhead is zero.
>
> Right, it is difficult to justify this overhead for smaller systems,
> which for sure would not benefit from this batched TLB framework.

Thank you for the pointers to the overhead.

Having looked at this more closely, I also see that
flush_tlb_batched_pending() discards the entire mm vs just flushing the
page being unmapped (as is done with ptep_clear_flush()).