[permalink] [raw]

Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On Wed, Jul 26, 2023 at 6:49 AM Yin Fengwei <[email protected]> wrote:
>
>
>
> On 7/15/23 14:06, Yu Zhao wrote:
> > On Wed, Jul 12, 2023 at 12:31 AM Yu Zhao <[email protected]> wrote:
> >>
> >> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
> >>>
> >>> Current kernel only lock base size folio during mlock syscall.
> >>> Add large folio support with following rules:
> >>> - Only mlock large folio when it's in VM_LOCKED VMA range
> >>>
> >>> - If there is cow folio, mlock the cow folio as cow folio
> >>> is also in VM_LOCKED VMA range.
> >>>
> >>> - munlock will apply to the large folio which is in VMA range
> >>> or cross the VMA boundary.
> >>>
> >>> The last rule is used to handle the case that the large folio is
> >>> mlocked, later the VMA is split in the middle of large folio
> >>> and this large folio become cross VMA boundary.
> >>>
> >>> Signed-off-by: Yin Fengwei <[email protected]>
> >>> ---
> >>> mm/mlock.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> >>> 1 file changed, 99 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/mm/mlock.c b/mm/mlock.c
> >>> index 0a0c996c5c214..f49e079066870 100644
> >>> --- a/mm/mlock.c
> >>> +++ b/mm/mlock.c
> >>> @@ -305,6 +305,95 @@ void munlock_folio(struct folio *folio)
> >>> local_unlock(&mlock_fbatch.lock);
> >>> }
> >>>
> >>> +static inline bool should_mlock_folio(struct folio *folio,
> >>> + struct vm_area_struct *vma)
> >>> +{
> >>> + if (vma->vm_flags & VM_LOCKED)
> >>> + return (!folio_test_large(folio) ||
> >>> + folio_within_vma(folio, vma));
> >>> +
> >>> + /*
> >>> + * For unlock, allow munlock large folio which is partially
> >>> + * mapped to VMA. As it's possible that large folio is
> >>> + * mlocked and VMA is split later.
> >>> + *
> >>> + * During memory pressure, such kind of large folio can
> >>> + * be split. And the pages are not in VM_LOCKed VMA
> >>> + * can be reclaimed.
> >>> + */
> >>> +
> >>> + return true;
> >>
> >> Looks good, or just
> >>
> >> should_mlock_folio() // or whatever name you see fit, can_mlock_folio()?
> >> {
> >> return !(vma->vm_flags & VM_LOCKED) || folio_within_vma();
> >> }
> >>
> >>> +}
> >>> +
> >>> +static inline unsigned int get_folio_mlock_step(struct folio *folio,
> >>> + pte_t pte, unsigned long addr, unsigned long end)
> >>> +{
> >>> + unsigned int nr;
> >>> +
> >>> + nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
> >>> + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
> >>> +}
> >>> +
> >>> +void mlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
> >>> + pte_t *pte, unsigned long addr, unsigned int nr)
> >>> +{
> >>> + struct folio *cow_folio;
> >>> + unsigned int step = 1;
> >>> +
> >>> + mlock_folio(folio);
> >>> + if (nr == 1)
> >>> + return;
> >>> +
> >>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
> >>> + pte_t ptent;
> >>> +
> >>> + step = 1;
> >>> + ptent = ptep_get(pte);
> >>> +
> >>> + if (!pte_present(ptent))
> >>> + continue;
> >>> +
> >>> + cow_folio = vm_normal_folio(vma, addr, ptent);
> >>> + if (!cow_folio || cow_folio == folio) {
> >>> + continue;
> >>> + }
> >>> +
> >>> + mlock_folio(cow_folio);
> >>> + step = get_folio_mlock_step(folio, ptent,
> >>> + addr, addr + (nr << PAGE_SHIFT));
> >>> + }
> >>> +}
> >>> +
> >>> +void munlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
> >>> + pte_t *pte, unsigned long addr, unsigned int nr)
> >>> +{
> >>> + struct folio *cow_folio;
> >>> + unsigned int step = 1;
> >>> +
> >>> + munlock_folio(folio);
> >>> + if (nr == 1)
> >>> + return;
> >>> +
> >>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
> >>> + pte_t ptent;
> >>> +
> >>> + step = 1;
> >>> + ptent = ptep_get(pte);
> >>> +
> >>> + if (!pte_present(ptent))
> >>> + continue;
> >>> +
> >>> + cow_folio = vm_normal_folio(vma, addr, ptent);
> >>> + if (!cow_folio || cow_folio == folio) {
> >>> + continue;
> >>> + }
> >>> +
> >>> + munlock_folio(cow_folio);
> >>> + step = get_folio_mlock_step(folio, ptent,
> >>> + addr, addr + (nr << PAGE_SHIFT));
> >>> + }
> >>> +}
> >>
> >> I'll finish the above later.
> >
> > There is a problem here that I didn't have the time to elaborate: we
> > can't mlock() a folio that is within the range but not fully mapped
> > because this folio can be on the deferred split queue. When the split
> > happens, those unmapped folios (not mapped by this vma but are mapped
> > into other vmas) will be stranded on the unevictable lru.
> Checked remap case in past few days, I agree we shouldn't treat a folio
> in the range but not fully mapped as in_range folio.
>
> As for remap case, it's possible that the folio is not in deferred split
> queue. But part of folio is mapped to VM_LOCKED vma and other part of
> folio is mapped to none VM_LOCKED vma. In this case, page can't be split
> as it's not in deferred split queue. So page reclaim should be allowed to
> pick this folio up, split it and reclaim the pages in none VM_LOCKED vma.
> So we can't mlock such kind of folio.
>
> The same thing can happen with madvise_cold_or_pageout_pte_range().
> I will update folio_in_vma() to check the PTE also.

Thanks, and I think we should move forward with this series and fix
the potential mlock race problem separately since it's not caused by
this series.

WDYT?

2023-07-27 00:36:57

by Yin, Fengwei

[permalink] [raw]

Subject: Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

On 7/27/23 00:57, Yu Zhao wrote:
> On Wed, Jul 26, 2023 at 6:49 AM Yin Fengwei <[email protected]> wrote:
>>
>>
>>
>> On 7/15/23 14:06, Yu Zhao wrote:
>>> On Wed, Jul 12, 2023 at 12:31 AM Yu Zhao <[email protected]> wrote:
>>>>
>>>> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <[email protected]> wrote:
>>>>>
>>>>> Current kernel only lock base size folio during mlock syscall.
>>>>> Add large folio support with following rules:
>>>>> - Only mlock large folio when it's in VM_LOCKED VMA range
>>>>>
>>>>> - If there is cow folio, mlock the cow folio as cow folio
>>>>> is also in VM_LOCKED VMA range.
>>>>>
>>>>> - munlock will apply to the large folio which is in VMA range
>>>>> or cross the VMA boundary.
>>>>>
>>>>> The last rule is used to handle the case that the large folio is
>>>>> mlocked, later the VMA is split in the middle of large folio
>>>>> and this large folio become cross VMA boundary.
>>>>>
>>>>> Signed-off-by: Yin Fengwei <[email protected]>
>>>>> ---
>>>>> mm/mlock.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>>> 1 file changed, 99 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>>>> index 0a0c996c5c214..f49e079066870 100644
>>>>> --- a/mm/mlock.c
>>>>> +++ b/mm/mlock.c
>>>>> @@ -305,6 +305,95 @@ void munlock_folio(struct folio *folio)
>>>>> local_unlock(&mlock_fbatch.lock);
>>>>> }
>>>>>
>>>>> +static inline bool should_mlock_folio(struct folio *folio,
>>>>> + struct vm_area_struct *vma)
>>>>> +{
>>>>> + if (vma->vm_flags & VM_LOCKED)
>>>>> + return (!folio_test_large(folio) ||
>>>>> + folio_within_vma(folio, vma));
>>>>> +
>>>>> + /*
>>>>> + * For unlock, allow munlock large folio which is partially
>>>>> + * mapped to VMA. As it's possible that large folio is
>>>>> + * mlocked and VMA is split later.
>>>>> + *
>>>>> + * During memory pressure, such kind of large folio can
>>>>> + * be split. And the pages are not in VM_LOCKed VMA
>>>>> + * can be reclaimed.
>>>>> + */
>>>>> +
>>>>> + return true;
>>>>
>>>> Looks good, or just
>>>>
>>>> should_mlock_folio() // or whatever name you see fit, can_mlock_folio()?
>>>> {
>>>> return !(vma->vm_flags & VM_LOCKED) || folio_within_vma();
>>>> }
>>>>
>>>>> +}
>>>>> +
>>>>> +static inline unsigned int get_folio_mlock_step(struct folio *folio,
>>>>> + pte_t pte, unsigned long addr, unsigned long end)
>>>>> +{
>>>>> + unsigned int nr;
>>>>> +
>>>>> + nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte);
>>>>> + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT);
>>>>> +}
>>>>> +
>>>>> +void mlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
>>>>> + pte_t *pte, unsigned long addr, unsigned int nr)
>>>>> +{
>>>>> + struct folio *cow_folio;
>>>>> + unsigned int step = 1;
>>>>> +
>>>>> + mlock_folio(folio);
>>>>> + if (nr == 1)
>>>>> + return;
>>>>> +
>>>>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
>>>>> + pte_t ptent;
>>>>> +
>>>>> + step = 1;
>>>>> + ptent = ptep_get(pte);
>>>>> +
>>>>> + if (!pte_present(ptent))
>>>>> + continue;
>>>>> +
>>>>> + cow_folio = vm_normal_folio(vma, addr, ptent);
>>>>> + if (!cow_folio || cow_folio == folio) {
>>>>> + continue;
>>>>> + }
>>>>> +
>>>>> + mlock_folio(cow_folio);
>>>>> + step = get_folio_mlock_step(folio, ptent,
>>>>> + addr, addr + (nr << PAGE_SHIFT));
>>>>> + }
>>>>> +}
>>>>> +
>>>>> +void munlock_folio_range(struct folio *folio, struct vm_area_struct *vma,
>>>>> + pte_t *pte, unsigned long addr, unsigned int nr)
>>>>> +{
>>>>> + struct folio *cow_folio;
>>>>> + unsigned int step = 1;
>>>>> +
>>>>> + munlock_folio(folio);
>>>>> + if (nr == 1)
>>>>> + return;
>>>>> +
>>>>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) {
>>>>> + pte_t ptent;
>>>>> +
>>>>> + step = 1;
>>>>> + ptent = ptep_get(pte);
>>>>> +
>>>>> + if (!pte_present(ptent))
>>>>> + continue;
>>>>> +
>>>>> + cow_folio = vm_normal_folio(vma, addr, ptent);
>>>>> + if (!cow_folio || cow_folio == folio) {
>>>>> + continue;
>>>>> + }
>>>>> +
>>>>> + munlock_folio(cow_folio);
>>>>> + step = get_folio_mlock_step(folio, ptent,
>>>>> + addr, addr + (nr << PAGE_SHIFT));
>>>>> + }
>>>>> +}
>>>>
>>>> I'll finish the above later.
>>>
>>> There is a problem here that I didn't have the time to elaborate: we
>>> can't mlock() a folio that is within the range but not fully mapped
>>> because this folio can be on the deferred split queue. When the split
>>> happens, those unmapped folios (not mapped by this vma but are mapped
>>> into other vmas) will be stranded on the unevictable lru.
>> Checked remap case in past few days, I agree we shouldn't treat a folio
>> in the range but not fully mapped as in_range folio.
>>
>> As for remap case, it's possible that the folio is not in deferred split
>> queue. But part of folio is mapped to VM_LOCKED vma and other part of
>> folio is mapped to none VM_LOCKED vma. In this case, page can't be split
>> as it's not in deferred split queue. So page reclaim should be allowed to
>> pick this folio up, split it and reclaim the pages in none VM_LOCKED vma.
>> So we can't mlock such kind of folio.
>>
>> The same thing can happen with madvise_cold_or_pageout_pte_range().
>> I will update folio_in_vma() to check the PTE also.
>
> Thanks, and I think we should move forward with this series and fix
> the potential mlock race problem separately since it's not caused by
> this series.
>
> WDYT?

Yes. Agree. Will send v3 with remap case covered.

Regards
Yin, Fengwei