2024-03-18 09:33:55

by zhaoyang.huang

[permalink] [raw]
Subject: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

>On Mon, Mar 18, 2024 at 2:15 PM Zhaoyang Huang
><[email protected]> wrote:
>>
>> On Mon, Mar 18, 2024 at 11:28 AM Matthew Wilcox <[email protected]>
>wrote:
>> >
>> > On Mon, Mar 18, 2024 at 01:37:04AM +0000, 黄朝阳 (Zhaoyang Huang)
>wrote:
>> > > >On Sun, Mar 17, 2024 at 12:07:40PM +0800, Zhaoyang Huang wrote:
>> > > >> Could it be this scenario, where folio comes from pte(thread
>> > > >> 0), local fbatch(thread 1) and page cache(thread 2)
>> > > >> concurrently and proceed intermixed without lock's protection?
>> > > >> Actually, IMO, thread 1 also could see the folio with refcnt==1
>> > > >> since it doesn't care if the page is on the page cache or not.
>> > > >>
>> > > >> madivise_cold_and_pageout does no explicit folio_get thing
>> > > >> since the folio comes from pte which implies it has one refcnt
>> > > >> from pagecache
>> > > >
>> > > >Mmm, no. It's implicit, but madvise_cold_or_pageout_pte_range()
>> > > >does guarantee that the folio has at least one refcount.
>> > > >
>> > > >Since we get the folio from vm_normal_folio(vma, addr, ptent); we
>> > > >know that there is at least one mapcount on the folio. refcount is
>always >= mapcount.
>> > > >Since we hold pte_offset_map_lock(), we know that mapcount (and
>> > > >therefore
>> > > >refcount) cannot be decremented until we call pte_unmap_unlock(),
>> > > >which we don't do until we have called folio_isolate_lru().
>> > > >
>> > > >Good try though, took me a few minutes of looking at it to
>> > > >convince myself that it was safe.
>> > > >
>> > > >Something to bear in mind is that if the race you outline is
>> > > >real, failing to hold a refcount on the folio leaves the caller
>> > > >susceptible to the VM_BUG_ON_FOLIO(!folio_ref_count(folio),
>> > > >folio); if the other thread calls folio_put().
>> > > Resend the chart via outlook.
>> > > I think the problem rely on an special timing which is rare, I would like to
>list them below in timing sequence.
>> > >
>> > > 1. thread 0 calls folio_isolate_lru with refcnt == 1
>> >
>> > (i assume you mean refcnt == 2 here, otherwise none of this makes
>> > sense)
>> >
>> > > 2. thread 1 calls release_pages with refcnt == 2.(IMO, it could be
>> > > 1 as release_pages doesn't care if the folio is used by page cache
>> > > or fs) 3. thread 2 decrease refcnt to 1 by calling
>> > > filemap_free_folio.(as I mentioned in 2, thread 2 is not mandatary
>> > > here) 4. thread 1 calls folio_put_testzero and pass.(lruvec->lock
>> > > has not been take here)
>> >
>> > But there's already a bug here.
>> >
>> > Rearrange the order of this:
>> >
>> > 2. thread 1 calls release_pages with refcount == 2 (decreasing
>> > refcount to 1) 3. thread 2 decrease refcount to 0 by calling
>> > filemap_free_folio 1. thread 0 calls folio_isolate_lru() and hits the BUG().
>> >
>> > > 5. thread 0 clear folio's PG_lru by calling folio_test_clear_lru. The
>folio_get behind has no meaning there.
>> > > 6. thread 1 failed in folio_test_lru and leave the folio on the LRU.
>> > > 7. thread 1 add folio to pages_to_free wrongly which could break
>> > > the LRU's->list and will have next folio experience
>> > > list_del_invalid
>> > >
>> > > #thread 0(madivise_cold_and_pageout)
>#1(lru_add_drain->fbatch_release_pages)
>#2(read_pages->filemap_remove_folios)
>> > > refcnt == 1(represent page cache) refcnt==2(another one
>represent LRU) folio comes from page cache
>> >
>> > This is still illegible. Try it this way:
>> >
>> > Thread 0 Thread 1 Thread 2
>> > madvise_cold_or_pageout_pte_range
>> > lru_add_drain
>> > fbatch_release_pages
>> > read_pages
>> > filemap_remove_folio
>> Thread 0 Thread 1 Thread 2
>> madvise_cold_or_pageout_pte_range
>> truncate_inode_pages_range
>> fbatch_release_pages
>> truncate_inode_pages_range
>> filemap_remove_folio Sorry for the
>> confusion. Rearrange the timing chart like above according to the real
>> panic's stacktrace. Thread 1&2 are all from
>> truncate_inode_pages_range(I think thread2(read_pages) is not
>> mandatory here as thread 0&1 could rely on the same refcnt==1).
>> >
>> > Some accuracy in your report would also be appreciated. There's no
>> > function called madivise_cold_and_pageout, nor is there a function
>> > called filemap_remove_folios(). It's a little detail, but it's
>> > annoying for me to try to find which function you're actually
>> > referring to. I have to guess, and it puts me in a bad mood.
>> >
>> > At any rate, these three functions cannot do what you're proposing.
>> > In read_page(), when we call filemap_remove_folio(), the folio in
>> > question will not have the uptodate flag set, so can never have been
>> > put in the page tables, so cannot be found by madvise().
>> >
>> > Also, as I said in my earlier email,
>> > madvise_cold_or_pageout_pte_range()
>> > does guarantee that the refcount on the folio is held and can never
>> > decrease to zero while folio_isolate_lru() is running. So that's
>> > two ways this scenario cannot happen.
>> The madivse_xxx comes from my presumption which has any proof.
>> Whereas, It looks like truncate_inode_pages_range just cares about
>> page cache refcnt by folio_put_testzero without noticing any task's VM
>> stuff. Furthermore, I notice that move_folios_to_lru is safe as it
>> runs with holding lruvec->lock.
>> >
>BTW, I think we need to protect all
>folio_test_clear_lru/folio_test_lru by moving them into lruvec->lock in such as
>__page_cache_release and folio_activate functions.
>Otherwise, there is always a race window between judging PG_lru and
>following actions.

Summarize all information below to make it more clear(remove thread2 which is not mandatory and make the scenario complex)

#thread 0(madivise_cold_and_pageout) #thread1(truncate_inode_pages_range)
pte_offset_map_lock takes NO lock
truncate_inode_folio(refcnt == 2)
<decrease the refcnt of page cache>
folio_isolate_lru(refcnt == 1)
release_pages(refcnt == 1)
folio_test_clear_lru
<remove folio's PG_lru>
folio_put_testzero == true
folio_get(refer to isolation)
folio_test_lru == false
<No lruvec_del_folio>
list_add(folio->lru, pages_to_free)
****current folio will break LRU's integrity since it has not been deleted****

0. Folio's refcnt decrease from 2 to 1 by filemap_remove_folio
1. thread 0 calls folio_isolate_lru with refcnt == 1. Folio comes from vm's pte
2. thread 1 calls release_pages with refcnt == 1. Folio comes from address_space
(refcnt == 1 make sense for both of folio_isolate_lru and release_pages)
3. thread0 clear folio's PG_lru by folio_test_clear_lru
4. thread1 decrease folio's refcnt from 1 to 0 and get permission to proceed
5. thread1 failed in folio_test_lru and do no list_del(folio)
6. thread1 add folio to pages_to_free wrongly which break the LRU's->list
7. next folio after current one within thread1 experiences list_del_invalid when calling lruvec_del_folio


2024-03-18 12:32:48

by Matthew Wilcox

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru


Stop creating new threads. You're really annoying.

On Mon, Mar 18, 2024 at 09:32:32AM +0000, 黄朝阳 (Zhaoyang Huang) wrote:
> Summarize all information below to make it more clear(remove thread2 which is not mandatory and make the scenario complex)

You've gone back to over-indenting. STOP IT.

> #thread 0(madivise_cold_and_pageout) #thread1(truncate_inode_pages_range)

This is still an impossible race, and it's the third time I've told you
this. And madivise_cold_and_pageout does not exist, it's
madvise_cold_or_pageout_pte_range(). I'm going to stop responding to
your emails if you keep on uselessly repeating the same mistakes.

So, once again,

For madvise_cold_or_pageout_pte_range() to find a page, it must have
a PTE pointing to the page. That means there's a mapcount on the page.
That means there's a refcount on the page.

truncate_inode_pages_range() will indeed attempt to remove a page from
the page cache. BUT before it does that, it has to shoot down TLB
entries that refer to the affected folios. That happens like this:

for (i = 0; i < folio_batch_count(&fbatch); i++)
truncate_cleanup_folio(fbatch.folios[i]);
truncate_cleanup_folio() -> unmap_mapping_folio ->
unmap_mapping_range_tree() -> unmap_mapping_range_vma() ->
zap_page_range_single() -> unmap_single_vma -> unmap_page_range ->
zap_p4d_range -> zap_pud_range -> zap_pmd_range -> zap_pte_range ->
pte_offset_map_lock()

> pte_offset_map_lock takes NO lock
> truncate_inode_folio(refcnt == 2)
> <decrease the refcnt of page cache>
> folio_isolate_lru(refcnt == 1)
> release_pages(refcnt == 1)
> folio_test_clear_lru
> <remove folio's PG_lru>
> folio_put_testzero == true
> folio_get(refer to isolation)
> folio_test_lru == false
> <No lruvec_del_folio>
> list_add(folio->lru, pages_to_free)
> ****current folio will break LRU's integrity since it has not been deleted****
>
> 0. Folio's refcnt decrease from 2 to 1 by filemap_remove_folio
> 1. thread 0 calls folio_isolate_lru with refcnt == 1. Folio comes from vm's pte
> 2. thread 1 calls release_pages with refcnt == 1. Folio comes from address_space
> (refcnt == 1 make sense for both of folio_isolate_lru and release_pages)
> 3. thread0 clear folio's PG_lru by folio_test_clear_lru
> 4. thread1 decrease folio's refcnt from 1 to 0 and get permission to proceed
> 5. thread1 failed in folio_test_lru and do no list_del(folio)
> 6. thread1 add folio to pages_to_free wrongly which break the LRU's->list
> 7. next folio after current one within thread1 experiences list_del_invalid when calling lruvec_del_folio

2024-03-19 00:49:09

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Mon, Mar 18, 2024 at 8:32 PM Matthew Wilcox <[email protected]> wrote:
>
>
> Stop creating new threads. You're really annoying.
>
> On Mon, Mar 18, 2024 at 09:32:32AM +0000, 黄朝阳 (Zhaoyang Huang) wrote:
> > Summarize all information below to make it more clear(remove thread2 which is not mandatory and make the scenario complex)
>
> You've gone back to over-indenting. STOP IT.
>
> > #thread 0(madivise_cold_and_pageout) #thread1(truncate_inode_pages_range)
>
> This is still an impossible race, and it's the third time I've told you
> this. And madivise_cold_and_pageout does not exist, it's
> madvise_cold_or_pageout_pte_range(). I'm going to stop responding to
> your emails if you keep on uselessly repeating the same mistakes.
>
> So, once again,
>
> For madvise_cold_or_pageout_pte_range() to find a page, it must have
> a PTE pointing to the page. That means there's a mapcount on the page.
> That means there's a refcount on the page.
>
> truncate_inode_pages_range() will indeed attempt to remove a page from
> the page cache. BUT before it does that, it has to shoot down TLB
> entries that refer to the affected folios. That happens like this:
>
> for (i = 0; i < folio_batch_count(&fbatch); i++)
> truncate_cleanup_folio(fbatch.folios[i]);
> truncate_cleanup_folio() -> unmap_mapping_folio ->
> unmap_mapping_range_tree() -> unmap_mapping_range_vma() ->
> zap_page_range_single() -> unmap_single_vma -> unmap_page_range ->
> zap_p4d_range -> zap_pud_range -> zap_pmd_range -> zap_pte_range ->
> pte_offset_map_lock()
Sorry and thanks for the remind. I wonder if it is possible that
madvise_cold_or_pageout_pte_range join these races until
truncate_inode_pages_range finish doing pte cleanup via
truncate_cleanup_folio which seems could still make the bellowing race
timing make sense. BTW, damon_pa_pageout is a potential risk over this
race

>
> > pte_offset_map_lock takes NO lock
> > truncate_inode_folio(refcnt == 2)
> > <decrease the refcnt of page cache>
> > folio_isolate_lru(refcnt == 1)
> > release_pages(refcnt == 1)
> > folio_test_clear_lru
> > <remove folio's PG_lru>
> > folio_put_testzero == true
> > folio_get(refer to isolation)
> > folio_test_lru == false
> > <No lruvec_del_folio>
> > list_add(folio->lru, pages_to_free)
> > ****current folio will break LRU's integrity since it has not been deleted****
> >
> > 0. Folio's refcnt decrease from 2 to 1 by filemap_remove_folio
> > 1. thread 0 calls folio_isolate_lru with refcnt == 1. Folio comes from vm's pte
> > 2. thread 1 calls release_pages with refcnt == 1. Folio comes from address_space
> > (refcnt == 1 make sense for both of folio_isolate_lru and release_pages)
> > 3. thread0 clear folio's PG_lru by folio_test_clear_lru
> > 4. thread1 decrease folio's refcnt from 1 to 0 and get permission to proceed
> > 5. thread1 failed in folio_test_lru and do no list_del(folio)
> > 6. thread1 add folio to pages_to_free wrongly which break the LRU's->list
> > 7. next folio after current one within thread1 experiences list_del_invalid when calling lruvec_del_folio

2024-03-19 03:01:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Tue, Mar 19, 2024 at 08:48:42AM +0800, Zhaoyang Huang wrote:
> BTW, damon_pa_pageout is a potential risk over this race

No it isn't.

struct folio *folio = damon_get_folio(PHYS_PFN(addr));

if (!folio)
continue;

if (damos_pa_filter_out(s, folio))
goto put_folio;

folio_clear_referenced(folio);
folio_test_clear_young(folio);
if (!folio_isolate_lru(folio))
goto put_folio;
if (folio_test_unevictable(folio))
folio_putback_lru(folio);
else
list_add(&folio->lru, &folio_list);
put_folio:
folio_put(folio);

It clearly has a folio reference when it calls folio_isolate_lru().

2024-03-21 08:25:30

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Tue, Mar 19, 2024 at 11:01 AM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Mar 19, 2024 at 08:48:42AM +0800, Zhaoyang Huang wrote:
> > BTW, damon_pa_pageout is a potential risk over this race
>
> No it isn't.
>
> struct folio *folio = damon_get_folio(PHYS_PFN(addr));
>
> if (!folio)
> continue;
>
> if (damos_pa_filter_out(s, folio))
> goto put_folio;
>
> folio_clear_referenced(folio);
> folio_test_clear_young(folio);
> if (!folio_isolate_lru(folio))
> goto put_folio;
> if (folio_test_unevictable(folio))
> folio_putback_lru(folio);
> else
> list_add(&folio->lru, &folio_list);
> put_folio:
> folio_put(folio);
>
> It clearly has a folio reference when it calls folio_isolate_lru().

ok. Could the scenario below be suspicious on leaving an orphan folio
in step 7 and introduce the bug in step 8. In the scenario,
Thread_filemap behaves as a backdoor for Thread_madv by creating the
pte after Thread_truncate finishes cleaning all page tables.

0. Thread_bad gets the folio by folio_get_entry and stores it in its
local fbatch_bad and go to sleep

1. Thread_filemap get the folio via
filemap_map_pages->next_uptodate_folio->xas_next_entry and gets
preempted
refcnt == 1(page_cache), PG_lru == true

2. Thread_truncate get the folio via
truncate_inode_pages_range->find_lock_entries
refcnt == 2(fbatch_trunc, page_cache), PG_lru == true

3. Thread_truncate proceed to truncate_cleanup_folio
refcnt == 2(fbatch_trunc, page_cache), PG_lru == true

4. Thread_truncate proceed to delete_from_page_cache_batch
refcnt == 1(fbatch_trunc), PG_lru == true

5. Thread_filemap schedule back and proceed to setup a pte and have
folio->_mapcnt = 0 & folio->refcnt += 1
refcnt == 2(pte, fbatch_temp), PG_lru == true

6. Thread_madv clear folio's PG_lru by
madvise_xxx_pte_range->folio_isolate_lru->folio_test_clear_lru
refcnt == 2(pte,fbatch_temp), PG_lru == false

7. Thread_truncate call folio_fbatch_release and failed in freeing
folio as refcnt not reach 0
refcnt == 1(pte), PG_lru == false
********folio becomes an orphan here which is not on the page cache
but on the task's VM**********

8. Thread_xxx scheduled back from 0 to do release_pages(fbatch_bad)
and have the folio introduce the bug.

2024-03-21 12:36:26

by Matthew Wilcox

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Thu, Mar 21, 2024 at 04:25:07PM +0800, Zhaoyang Huang wrote:
> ok. Could the scenario below be suspicious on leaving an orphan folio
> in step 7 and introduce the bug in step 8. In the scenario,
> Thread_filemap behaves as a backdoor for Thread_madv by creating the
> pte after Thread_truncate finishes cleaning all page tables.
>
> 0. Thread_bad gets the folio by folio_get_entry and stores it in its
> local fbatch_bad and go to sleep

There's no function called folio_get_entry(), but clearly thread_bad
should have a refcount on it at this point.

> 1. Thread_filemap get the folio via
> filemap_map_pages->next_uptodate_folio->xas_next_entry and gets
> preempted
> refcnt == 1(page_cache), PG_lru == true

so the refcount should be 2 here.

> 2. Thread_truncate get the folio via
> truncate_inode_pages_range->find_lock_entries
> refcnt == 2(fbatch_trunc, page_cache), PG_lru == true
>
> 3. Thread_truncate proceed to truncate_cleanup_folio
> refcnt == 2(fbatch_trunc, page_cache), PG_lru == true
>
> 4. Thread_truncate proceed to delete_from_page_cache_batch
> refcnt == 1(fbatch_trunc), PG_lru == true
>
> 5. Thread_filemap schedule back and proceed to setup a pte and have
> folio->_mapcnt = 0 & folio->refcnt += 1
> refcnt == 2(pte, fbatch_temp), PG_lru == true
>
> 6. Thread_madv clear folio's PG_lru by
> madvise_xxx_pte_range->folio_isolate_lru->folio_test_clear_lru
> refcnt == 2(pte,fbatch_temp), PG_lru == false
>
> 7. Thread_truncate call folio_fbatch_release and failed in freeing
> folio as refcnt not reach 0
> refcnt == 1(pte), PG_lru == false
> ********folio becomes an orphan here which is not on the page cache
> but on the task's VM**********
>
> 8. Thread_xxx scheduled back from 0 to do release_pages(fbatch_bad)
> and have the folio introduce the bug.

.. because if these steps happen as 7, 8, 6, you hit the BUG in
folio_isolate_lru().

2024-03-22 01:53:05

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Thu, Mar 21, 2024 at 8:36 PM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Mar 21, 2024 at 04:25:07PM +0800, Zhaoyang Huang wrote:
> > ok. Could the scenario below be suspicious on leaving an orphan folio
> > in step 7 and introduce the bug in step 8. In the scenario,
> > Thread_filemap behaves as a backdoor for Thread_madv by creating the
> > pte after Thread_truncate finishes cleaning all page tables.
> >
> > 0. Thread_bad gets the folio by folio_get_entry and stores it in its
> > local fbatch_bad and go to sleep
>
> There's no function called folio_get_entry(), but clearly thread_bad
> should have a refcount on it at this point.
>
> > 1. Thread_filemap get the folio via
> > filemap_map_pages->next_uptodate_folio->xas_next_entry and gets
> > preempted
> > refcnt == 1(page_cache), PG_lru == true
>
> so the refcount should be 2 here.
>
> > 2. Thread_truncate get the folio via
> > truncate_inode_pages_range->find_lock_entries
> > refcnt == 2(fbatch_trunc, page_cache), PG_lru == true
> >
> > 3. Thread_truncate proceed to truncate_cleanup_folio
> > refcnt == 2(fbatch_trunc, page_cache), PG_lru == true
> >
> > 4. Thread_truncate proceed to delete_from_page_cache_batch
> > refcnt == 1(fbatch_trunc), PG_lru == true
> >
> > 5. Thread_filemap schedule back and proceed to setup a pte and have
> > folio->_mapcnt = 0 & folio->refcnt += 1
> > refcnt == 2(pte, fbatch_temp), PG_lru == true
> >
> > 6. Thread_madv clear folio's PG_lru by
> > madvise_xxx_pte_range->folio_isolate_lru->folio_test_clear_lru
> > refcnt == 2(pte,fbatch_temp), PG_lru == false
> >
> > 7. Thread_truncate call folio_fbatch_release and failed in freeing
> > folio as refcnt not reach 0
> > refcnt == 1(pte), PG_lru == false
> > ********folio becomes an orphan here which is not on the page cache
> > but on the task's VM**********
> >
> > 8. Thread_xxx scheduled back from 0 to do release_pages(fbatch_bad)
> > and have the folio introduce the bug.
>
> ... because if these steps happen as 7, 8, 6, you hit the BUG in
> folio_isolate_lru().
Thanks for the comments. fix the typo and update the timing sequence
by amending possible preempt points to have the refcnt make sense.

0. Thread_bad gets the folio by find_get_entry and preempted before
take refcnt(could be the second round scan of
truncate_inode_pages_range)
refcnt == 1(page_cache), PG_lru == true, PG_lock == false
find_get_entry
folio = xas_find
<preempted>
folio_try_get_rcu

1. Thread_filemap get the folio via
filemap_map_pages->next_uptodate_folio->xas_next_entry and gets preempted
refcnt == 1(page_cache), PG_lru == true, PG_lock == false
filemap_map_pages
next_uptodate_folio
xas_next_entry
<preempted>
folio_try_get_rcu

2. Thread_truncate get the folio via
truncate_inode_pages_range->find_lock_entries
refcnt == 2(page_cache, fbatch_truncate), PG_lru == true, PG_lock == true

3. Thread_truncate proceed to truncate_cleanup_folio
refcnt == 2(page_cache, fbatch_truncate), PG_lru == true, PG_lock == true

4. Thread_truncate proceed to delete_from_page_cache_batch
refcnt == 1(fbatch_truncate), PG_lru == true, PG_lock == true

4.1 folio_unlock
refcnt == 1(fbatch_truncate), PG_lru == true, PG_lock == false

5. Thread_filemap schedule back from '1' and proceed to setup a pte
and have folio->_mapcnt = 0 & folio->refcnt += 1
refcnt == 1->2(+fbatch_filemap)->3->2(pte, fbatch_truncate),
PG_lru == true, PG_lock == true->false

6. Thread_madv clear folio's PG_lru by
madvise_xxx_pte_range->folio_isolate_lru->folio_test_clear_lru
refcnt == 2(pte,fbatch_truncate), PG_lru == false, PG_lock == false

7. Thread_truncate call folio_fbatch_release and failed in freeing
folio as refcnt not reach 0
refcnt == 1(pte), PG_lru == false, PG_lock == false
********folio becomes an orphan here which is not on the page cache
but on the task's VM**********

8. Thread_bad scheduled back from '0' to be collected in fbatch_bad
refcnt == 2(pte, fbatch_bad), PG_lru == false, PG_lock == true

9. Thread_bad clear one refcnt wrongly when doing filemap_remove_folio
as it take this refcnt as the page cache one
refcnt == 1(fbatch_bad), PG_lru == false, PG_lock == true->false
truncate_inode_folio
filemap_remove_folio
filemap_free_folio
******refcnt decreased wrongly here by being taken as the page cache one ******

10. Thread_bad calls release_pages(fbatch_bad) and has the folio
introduce the bug.
release_pages
folio_put_testzero == true
folio_test_lru == false
list_add(folio->lru, pages_to_free)

2024-03-22 03:20:46

by Matthew Wilcox

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Fri, Mar 22, 2024 at 09:52:36AM +0800, Zhaoyang Huang wrote:
> Thanks for the comments. fix the typo and update the timing sequence
> by amending possible preempt points to have the refcnt make sense.
>
> 0. Thread_bad gets the folio by find_get_entry and preempted before
> take refcnt(could be the second round scan of
> truncate_inode_pages_range)
> refcnt == 1(page_cache), PG_lru == true, PG_lock == false
> find_get_entry
> folio = xas_find
> <preempted>
> folio_try_get_rcu
>
> 1. Thread_filemap get the folio via
> filemap_map_pages->next_uptodate_folio->xas_next_entry and gets preempted
> refcnt == 1(page_cache), PG_lru == true, PG_lock == false
> filemap_map_pages
> next_uptodate_folio
> xas_next_entry
> <preempted>
> folio_try_get_rcu
>
> 2. Thread_truncate get the folio via
> truncate_inode_pages_range->find_lock_entries
> refcnt == 2(page_cache, fbatch_truncate), PG_lru == true, PG_lock == true
>
> 3. Thread_truncate proceed to truncate_cleanup_folio
> refcnt == 2(page_cache, fbatch_truncate), PG_lru == true, PG_lock == true
>
> 4. Thread_truncate proceed to delete_from_page_cache_batch
> refcnt == 1(fbatch_truncate), PG_lru == true, PG_lock == true
>
> 4.1 folio_unlock
> refcnt == 1(fbatch_truncate), PG_lru == true, PG_lock == false

OK, so by the time we get to folio_unlock(), the folio has been removed
from the i_pages xarray.

> 5. Thread_filemap schedule back from '1' and proceed to setup a pte
> and have folio->_mapcnt = 0 & folio->refcnt += 1
> refcnt == 1->2(+fbatch_filemap)->3->2(pte, fbatch_truncate),
> PG_lru == true, PG_lock == true->false

This line succeeds (in next_uptodate_folio):
if (!folio_try_get_rcu(folio))
continue;
but then this fails:

if (unlikely(folio != xas_reload(xas)))
goto skip;
skip:
folio_put(folio);

because xas_reload() will return NULL due to the folio being deleted
in step 4. So we never get to the point where we set up a PTE.

There should be no way to create a new PTE for a folio which has been
removed from the page cache. Bugs happen, of course, but I don't see
one yet.

> 6. Thread_madv clear folio's PG_lru by
> madvise_xxx_pte_range->folio_isolate_lru->folio_test_clear_lru
> refcnt == 2(pte,fbatch_truncate), PG_lru == false, PG_lock == false
>
> 7. Thread_truncate call folio_fbatch_release and failed in freeing
> folio as refcnt not reach 0
> refcnt == 1(pte), PG_lru == false, PG_lock == false
> ********folio becomes an orphan here which is not on the page cache
> but on the task's VM**********
>
> 8. Thread_bad scheduled back from '0' to be collected in fbatch_bad
> refcnt == 2(pte, fbatch_bad), PG_lru == false, PG_lock == true
>
> 9. Thread_bad clear one refcnt wrongly when doing filemap_remove_folio
> as it take this refcnt as the page cache one
> refcnt == 1(fbatch_bad), PG_lru == false, PG_lock == true->false
> truncate_inode_folio
> filemap_remove_folio
> filemap_free_folio
> ******refcnt decreased wrongly here by being taken as the page cache one ******
>
> 10. Thread_bad calls release_pages(fbatch_bad) and has the folio
> introduce the bug.
> release_pages
> folio_put_testzero == true
> folio_test_lru == false
> list_add(folio->lru, pages_to_free)

2024-03-24 11:14:49

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Fri, Mar 22, 2024 at 11:20 AM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Mar 22, 2024 at 09:52:36AM +0800, Zhaoyang Huang wrote:
> > Thanks for the comments. fix the typo and update the timing sequence
> > by amending possible preempt points to have the refcnt make sense.
> >
> > 0. Thread_bad gets the folio by find_get_entry and preempted before
> > take refcnt(could be the second round scan of
> > truncate_inode_pages_range)
> > refcnt == 1(page_cache), PG_lru == true, PG_lock == false
> > find_get_entry
> > folio = xas_find
> > <preempted>
> > folio_try_get_rcu
> >
> > 1. Thread_filemap get the folio via
> > filemap_map_pages->next_uptodate_folio->xas_next_entry and gets preempted
> > refcnt == 1(page_cache), PG_lru == true, PG_lock == false
> > filemap_map_pages
> > next_uptodate_folio
> > xas_next_entry
> > <preempted>
> > folio_try_get_rcu
> >
> > 2. Thread_truncate get the folio via
> > truncate_inode_pages_range->find_lock_entries
> > refcnt == 2(page_cache, fbatch_truncate), PG_lru == true, PG_lock == true
> >
> > 3. Thread_truncate proceed to truncate_cleanup_folio
> > refcnt == 2(page_cache, fbatch_truncate), PG_lru == true, PG_lock == true
> >
> > 4. Thread_truncate proceed to delete_from_page_cache_batch
> > refcnt == 1(fbatch_truncate), PG_lru == true, PG_lock == true
> >
> > 4.1 folio_unlock
> > refcnt == 1(fbatch_truncate), PG_lru == true, PG_lock == false
>
> OK, so by the time we get to folio_unlock(), the folio has been removed
> from the i_pages xarray.
>
> > 5. Thread_filemap schedule back from '1' and proceed to setup a pte
> > and have folio->_mapcnt = 0 & folio->refcnt += 1
> > refcnt == 1->2(+fbatch_filemap)->3->2(pte, fbatch_truncate),
> > PG_lru == true, PG_lock == true->false
>
> This line succeeds (in next_uptodate_folio):
> if (!folio_try_get_rcu(folio))
> continue;
> but then this fails:
>
> if (unlikely(folio != xas_reload(xas)))
> goto skip;
> skip:
> folio_put(folio);
>
> because xas_reload() will return NULL due to the folio being deleted
> in step 4. So we never get to the point where we set up a PTE.
>
> There should be no way to create a new PTE for a folio which has been
> removed from the page cache. Bugs happen, of course, but I don't see
> one yet.
>
> > 6. Thread_madv clear folio's PG_lru by
> > madvise_xxx_pte_range->folio_isolate_lru->folio_test_clear_lru
> > refcnt == 2(pte,fbatch_truncate), PG_lru == false, PG_lock == false
> >
> > 7. Thread_truncate call folio_fbatch_release and failed in freeing
> > folio as refcnt not reach 0
> > refcnt == 1(pte), PG_lru == false, PG_lock == false
> > ********folio becomes an orphan here which is not on the page cache
> > but on the task's VM**********
> >
> > 8. Thread_bad scheduled back from '0' to be collected in fbatch_bad
> > refcnt == 2(pte, fbatch_bad), PG_lru == false, PG_lock == true
> >
> > 9. Thread_bad clear one refcnt wrongly when doing filemap_remove_folio
> > as it take this refcnt as the page cache one
> > refcnt == 1(fbatch_bad), PG_lru == false, PG_lock == true->false
> > truncate_inode_folio
> > filemap_remove_folio
> > filemap_free_folio
> > ******refcnt decreased wrongly here by being taken as the page cache one ******
> >
> > 10. Thread_bad calls release_pages(fbatch_bad) and has the folio
> > introduce the bug.
> > release_pages
> > folio_put_testzero == true
> > folio_test_lru == false
> > list_add(folio->lru, pages_to_free)

ok. It seems like madvise is robust enough to leave no BUGs. I presume
another two scenarios which call folio_isloate_lru by any other ways
but PTE. Besides, scenario 2 reminds me of a previous bug reported by
me as find_get_entry entered in a livelock where the folio's refcnt ==
0 but remains at xarray which causes the reset->retry loops forever. I
would like to reply in that thread for more details.

Scenario 1:
0. Thread_bad gets the folio by find_get_entry and preempted before
folio_lock (could be the second round scan of
truncate_inode_pages_range)
refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false
folio = find_get_entry
folio_try_get_rcu
<preempted>
folio_try_lock

1. Thread_truncate get the folio via
truncate_inode_pages_range->find_lock_entries
refcnt == 3(page_cache, fbatch_bad, fbatch_truncate), PG_lru ==
true, PG_lock == true

2. Thread_truncate proceed to truncate_cleanup_folio
refcnt == 3(page_cache, fbatch_bad, fbatch_truncate), PG_lru ==
true, PG_lock == true

3. Thread_truncate proceed to delete_from_page_cache_batch
refcnt == 2(fbatch_bad, fbatch_truncate), PG_lru == true, PG_lock == true

4 folio_unlock
refcnt == 2(fbatch_bad, fbatch_truncate), PG_lru == true, PG_lock == false

5. Thread_bad schedule back from step 0 and clear one refcnt wrongly
when doing truncate_inode_folio->filemap_remove_folio as it take this
refcnt as the page cache one
refcnt == 1'(fbatch_truncate), PG_lru == false, PG_lock == true
folio = find_get_entry
folio_try_get_rcu
folio_try_lock
truncate_inode_folio
filemap_remove_folio
<preempted>

6. Thread_isolate get one refcnt and call folio_isolate_lru(could be
any process)
refcnt == 2'(fbatch_truncate, thread_isolate), PG_lru == true,
PG_lock == true

7. Thread_isolate proceed to clear PG_lru and get preempted before folio_get
refcnt == 2'(fbatch_truncate, thread_isolate), PG_lru == false,
PG_lock == true
folio_test_clear_folio
<preempted>
folio_get

8. Thread_bad scheduling back from step 5 and proceed to drop one refcnt
refcnt == 1'(thread_isolate), PG_lru == false, PG_lock == true
folio = find_get_entry
folio_try_get_rcu
folio_try_lock
truncate_inode_folio
filemap_remove_folio
folio_unlock
<preempted>

9. Thread_truncate schedule back from step 3 and proceed to drop one
refcnt by release_pages and hit the BUG
refcnt == 0'(thread_isolate), PG_lru == false, PG_lock == false

Scenario 2:
0. Thread_bad gets the folio by find_get_entry and preempted before
folio_lock (could be the second round scan of
truncate_inode_pages_range)
refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false
folio = find_get_entry
folio_try_get_rcu
<preempted>
folio_try_lock

1. Thread_readahead remove the folio from page cache and drop one
refcnt by filemap_remove_folio(get rid of the folios which failed to
launch IO during readahead)
refcnt == 1(fbatch_bad), PG_lru == true, PG_lock == true

2. folio_unlock
refcnt == 1(fbatch_bad), PG_lru == true, PG_lock == false

3. Thread_isolate get one refcnt and call folio_isolate_lru(could be
any process)
refcnt == 2(fbatch_bad, thread_isolate), PG_lru == true, PG_lock == false

4. Thread_bad schedule back from step 0 and clear one refcnt wrongly
when doing truncate_inode_folio->filemap_remove_folio as it take this
refcnt as the page cache one
refcnt == 1'(thread_isolate), PG_lru == false, PG_lock == false
find_get_entries
folio = find_get_entry
folio_try_get_rcu
folio_lock
<no check as folio->mapping != mapping as folio_lock_entries does>
truncate_inode_folio
filemap_remove_folio
<preempted>

5. Thread_isolate proceed to clear PG_lru and get preempted before folio_get
refcnt == 1'(fbatch_truncate, thread_isolate), PG_lru == false,
PG_lock == false
folio_test_clear_folio
<preempted>
folio_get

6. Thread_bad schedule back from step 4 and proceed to drop one refcnt
by release_pages and hit the BUG
refcnt == 0'(thread_isolate), PG_lru == false, PG_lock == false

2024-03-25 12:25:23

by Matthew Wilcox

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Sun, Mar 24, 2024 at 07:14:27PM +0800, Zhaoyang Huang wrote:
> ok. It seems like madvise is robust enough to leave no BUGs. I presume
> another two scenarios which call folio_isloate_lru by any other ways
> but PTE. Besides, scenario 2 reminds me of a previous bug reported by
> me as find_get_entry entered in a livelock where the folio's refcnt ==
> 0 but remains at xarray which causes the reset->retry loops forever. I
> would like to reply in that thread for more details.
>
> Scenario 1:
> 0. Thread_bad gets the folio by find_get_entry and preempted before
> folio_lock (could be the second round scan of
> truncate_inode_pages_range)
> refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false
> folio = find_get_entry
> folio_try_get_rcu
> <preempted>
> folio_try_lock
>
> 1. Thread_truncate get the folio via
> truncate_inode_pages_range->find_lock_entries
> refcnt == 3(page_cache, fbatch_bad, fbatch_truncate), PG_lru ==
> true, PG_lock == true

Hang on, you can't have two threads in truncate_inode_pages_range()
at the same time. I appreciate that we don't have any documentation
of that, but if it were possible, we'd see other crashes. Removing
the folio from the page cache sets folio->mapping to NULL. And
__filemap_remove_folio() uses folio->mapping in
filemap_unaccount_folio() and page_cache_delete(), so we'd get NULL
pointer dereferences.

I see a hint in the DAX code that it's an fs-dependent lock:

/*
* This gets called from truncate / punch_hole path. As such, the caller
* must hold locks protecting against concurrent modifications of the
* page cache (usually fs-private i_mmap_sem for writing). Since the
* caller has seen a DAX entry for this index, we better find it
* at that index as well...
*/

so maybe that's why there's no lockdep_assert() in
truncate_inode_pages_range(), but there should be a comment.

> Scenario 2:
> 0. Thread_bad gets the folio by find_get_entry and preempted before
> folio_lock (could be the second round scan of
> truncate_inode_pages_range)
> refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false
> folio = find_get_entry
> folio_try_get_rcu
> <preempted>
> folio_try_lock
>
> 1. Thread_readahead remove the folio from page cache and drop one
> refcnt by filemap_remove_folio(get rid of the folios which failed to
> launch IO during readahead)
> refcnt == 1(fbatch_bad), PG_lru == true, PG_lock == true

So readaahead inserts the folio locked, and then calls
filemap_remove_folio() without having unlocked it.
filemap_remove_folio() sets folio->mapping to NULL in
page_cache_delete(). When "Thread_bad" wakes up, it gets the
folio lock, calls truncate_inode_folio() and sees that
folio->mapping != mapping, so it doesn't call filemap_remove_folio().

> 4. Thread_bad schedule back from step 0 and clear one refcnt wrongly
> when doing truncate_inode_folio->filemap_remove_folio as it take this
> refcnt as the page cache one
> refcnt == 1'(thread_isolate), PG_lru == false, PG_lock == false
> find_get_entries
> folio = find_get_entry
> folio_try_get_rcu
> folio_lock
> <no check as folio->mapping != mapping as folio_lock_entries does>
> truncate_inode_folio
> filemap_remove_folio
> <preempted>

2024-03-26 09:07:19

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Mon, Mar 25, 2024 at 11:22 AM Matthew Wilcox <[email protected]> wrote:
>
> On Sun, Mar 24, 2024 at 07:14:27PM +0800, Zhaoyang Huang wrote:
> > ok. It seems like madvise is robust enough to leave no BUGs. I presume
> > another two scenarios which call folio_isloate_lru by any other ways
> > but PTE. Besides, scenario 2 reminds me of a previous bug reported by
> > me as find_get_entry entered in a livelock where the folio's refcnt ==
> > 0 but remains at xarray which causes the reset->retry loops forever. I
> > would like to reply in that thread for more details.
> >
> > Scenario 1:
> > 0. Thread_bad gets the folio by find_get_entry and preempted before
> > folio_lock (could be the second round scan of
> > truncate_inode_pages_range)
> > refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false
> > folio = find_get_entry
> > folio_try_get_rcu
> > <preempted>
> > folio_try_lock
> >
> > 1. Thread_truncate get the folio via
> > truncate_inode_pages_range->find_lock_entries
> > refcnt == 3(page_cache, fbatch_bad, fbatch_truncate), PG_lru ==
> > true, PG_lock == true
>
> Hang on, you can't have two threads in truncate_inode_pages_range()
> at the same time. I appreciate that we don't have any documentation
> of that, but if it were possible, we'd see other crashes. Removing
> the folio from the page cache sets folio->mapping to NULL. And
> __filemap_remove_folio() uses folio->mapping in
> filemap_unaccount_folio() and page_cache_delete(), so we'd get NULL
> pointer dereferences.
ok. I will check if it is possible to have another way of entering
this scenario.
>
> I see a hint in the DAX code that it's an fs-dependent lock:
>
> /*
> * This gets called from truncate / punch_hole path. As such, the caller
> * must hold locks protecting against concurrent modifications of the
> * page cache (usually fs-private i_mmap_sem for writing). Since the
> * caller has seen a DAX entry for this index, we better find it
> * at that index as well...
> */
>
> so maybe that's why there's no lockdep_assert() in
> truncate_inode_pages_range(), but there should be a comment.
>
> > Scenario 2:
> > 0. Thread_bad gets the folio by find_get_entry and preempted before
> > folio_lock (could be the second round scan of
> > truncate_inode_pages_range)
> > refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false
> > folio = find_get_entry
> > folio_try_get_rcu
> > <preempted>
> > folio_try_lock
> >
> > 1. Thread_readahead remove the folio from page cache and drop one
> > refcnt by filemap_remove_folio(get rid of the folios which failed to
> > launch IO during readahead)
> > refcnt == 1(fbatch_bad), PG_lru == true, PG_lock == true
>
> So readaahead inserts the folio locked, and then calls
> filemap_remove_folio() without having unlocked it.
> filemap_remove_folio() sets folio->mapping to NULL in
> page_cache_delete(). When "Thread_bad" wakes up, it gets the
> folio lock, calls truncate_inode_folio() and sees that
> folio->mapping != mapping, so it doesn't call filemap_remove_folio().
>
> > 4. Thread_bad schedule back from step 0 and clear one refcnt wrongly
> > when doing truncate_inode_folio->filemap_remove_folio as it take this
> > refcnt as the page cache one
> > refcnt == 1'(thread_isolate), PG_lru == false, PG_lock == false
> > find_get_entries
> > folio = find_get_entry
> > folio_try_get_rcu
> > folio_lock
> > <no check as folio->mapping != mapping as folio_lock_entries does>
> > truncate_inode_folio
> > filemap_remove_folio
> > <preempted>
Please review the following scenario, where the folio dropped two
refcnt wrongly when cleaning Non-IO folios within ractl. Should we
change it to __readahead_folio?

0. Thread_bad gets the folio by find_get_entry and preempted after
folio_try_get_rcu (could be the second round scan of
truncate_inode_pages_range)
refcnt == 2(page_cache, fbatch_bad), PG_lru == true, PG_lock == false
folio = find_get_entry
folio_try_get_rcu
<preempted>

1. Thread_readahead remove the folio from page cache and drop 2 refcnt
by readahead_folio & filemap_remove_folio(get rid of the folios which
failed to launch IO during readahead)
refcnt == 0, PG_lru == true, PG_lock == true
read_pages
..
folio = readahead_folio
<one refcnt dropped here>
********For the folio which can not launch IO, we should NOT drop
refcnt here??? replaced by __readahead_folio???**********
folio_get
filemap_remove_folio(folio)
folio_unlock
<one refcnt dropped here>
folio_put

2. folio_unlock
refcnt == 0, PG_lru == true, PG_lock == false

3. Thread_isolate get one refcnt and call folio_isolate_lru(could be
any process)
refcnt == 1(thread_isolate), PG_lru == true, PG_lock == false

4. Thread_isolate proceed to clear PG_lru and get preempted before folio_get
refcnt == 1(thread_isolate), PG_lru == false, PG_lock == false
folio_test_clear_folio
<preempted>
folio_get

5. Thread_bad schedule back from step 0 and proceed to drop one refcnt
by release_pages and hit the BUG
refcnt == 0'(thread_isolate), PG_lru == false, PG_lock == false

2024-03-26 12:22:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Tue, Mar 26, 2024 at 05:06:55PM +0800, Zhaoyang Huang wrote:
> 1. Thread_readahead remove the folio from page cache and drop 2 refcnt
> by readahead_folio & filemap_remove_folio(get rid of the folios which
> failed to launch IO during readahead)
> refcnt == 0, PG_lru == true, PG_lock == true
> read_pages
> ...
> folio = readahead_folio
> <one refcnt dropped here>
> ********For the folio which can not launch IO, we should NOT drop
> refcnt here??? replaced by __readahead_folio???**********
> folio_get
> filemap_remove_folio(folio)
> folio_unlock
> <one refcnt dropped here>
> folio_put

Ignoring any other thread, you're basically saying that there's a
refcount imbalance here. Which means we'd hit an assert (that folio
refcount went below zero) in the normal case where another thread wasn't
simultaneously trying to do anything.

2024-03-27 01:28:38

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Tue, Mar 26, 2024 at 8:21 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Mar 26, 2024 at 05:06:55PM +0800, Zhaoyang Huang wrote:
> > 1. Thread_readahead remove the folio from page cache and drop 2 refcnt
> > by readahead_folio & filemap_remove_folio(get rid of the folios which
> > failed to launch IO during readahead)
> > refcnt == 0, PG_lru == true, PG_lock == true
> > read_pages
> > ...
> > folio = readahead_folio
> > <one refcnt dropped here>
> > ********For the folio which can not launch IO, we should NOT drop
> > refcnt here??? replaced by __readahead_folio???**********
> > folio_get
> > filemap_remove_folio(folio)
> > folio_unlock
> > <one refcnt dropped here>
> > folio_put
>
> Ignoring any other thread, you're basically saying that there's a
> refcount imbalance here. Which means we'd hit an assert (that folio
> refcount went below zero) in the normal case where another thread wasn't
> simultaneously trying to do anything.
Theoretically Yes but it is rare in practice as aops->readahead will
launch all pages to IO under most scenarios.

read_pages
aops->readahead[1]
..
while (folio = readahead_folio)[2]
filemap_remove_folio

IMO, according to the comments of readahead_page, the refcnt
represents page cache dropped in [1] makes sense for two reasons, '1.
The folio is going to do IO and is locked until IO done;2. The refcnt
will be added back when found again from the page cache and then serve
for PTE or vfs' while it doesn't make sense in [2] as the refcnt of
page cache will be dropped in filemap_remove_folio

* Context: The page is locked and has an elevated refcount. The caller
* should decreases the refcount once the page has been submitted for I/O
* and unlock the page once all I/O to that page has completed.
* Return: A pointer to the next page, or %NULL if we are done.

2024-03-27 14:09:44

by Matthew Wilcox

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Wed, Mar 27, 2024 at 09:25:59AM +0800, Zhaoyang Huang wrote:
> > Ignoring any other thread, you're basically saying that there's a
> > refcount imbalance here. Which means we'd hit an assert (that folio
> > refcount went below zero) in the normal case where another thread wasn't
> > simultaneously trying to do anything.
> Theoretically Yes but it is rare in practice as aops->readahead will
> launch all pages to IO under most scenarios.

Rare, but this path has been tested.

> read_pages
> aops->readahead[1]
> ...
> while (folio = readahead_folio)[2]
> filemap_remove_folio
>
> IMO, according to the comments of readahead_page, the refcnt
> represents page cache dropped in [1] makes sense for two reasons, '1.
> The folio is going to do IO and is locked until IO done;2. The refcnt
> will be added back when found again from the page cache and then serve
> for PTE or vfs' while it doesn't make sense in [2] as the refcnt of
> page cache will be dropped in filemap_remove_folio
>
> * Context: The page is locked and has an elevated refcount. The caller
> * should decreases the refcount once the page has been submitted for I/O
> * and unlock the page once all I/O to that page has completed.
> * Return: A pointer to the next page, or %NULL if we are done.

Follow the refcount through.

In page_cache_ra_unbounded():

folio = filemap_alloc_folio(gfp_mask, 0);
(folio has refcount 1)
ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
(folio has refcount 2)

Then we call read_pages()
First we call ->readahead() which for some reason stops early.
Then we call readahead_folio() which calls folio_put()
(folio has refcount 1)
Then we call folio_get()
(folio has refcount 2)
Then we call filemap_remove_folio()
(folio has refcount 1)
Then we call folio_unlock()
Then we call folio_put()
(folio has refcount 0 and is freed)

Yes, other things can happen in there to increment the refcount, so this
folio_put() might not be the last put, but we hold the folio locked the
entire time, so many things which might be attempted will block on the
folio lock. In particular, nobody can remove it from the page cache,
so its refcount cannot reach 0 until the last folio_put() of the
sequence.

2024-03-28 01:27:53

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Wed, Mar 27, 2024 at 8:31 PM Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Mar 27, 2024 at 09:25:59AM +0800, Zhaoyang Huang wrote:
> > > Ignoring any other thread, you're basically saying that there's a
> > > refcount imbalance here. Which means we'd hit an assert (that folio
> > > refcount went below zero) in the normal case where another thread wasn't
> > > simultaneously trying to do anything.
> > Theoretically Yes but it is rare in practice as aops->readahead will
> > launch all pages to IO under most scenarios.
>
> Rare, but this path has been tested.
>
> > read_pages
> > aops->readahead[1]
> > ...
> > while (folio = readahead_folio)[2]
> > filemap_remove_folio
> >
> > IMO, according to the comments of readahead_page, the refcnt
> > represents page cache dropped in [1] makes sense for two reasons, '1.
> > The folio is going to do IO and is locked until IO done;2. The refcnt
> > will be added back when found again from the page cache and then serve
> > for PTE or vfs' while it doesn't make sense in [2] as the refcnt of
> > page cache will be dropped in filemap_remove_folio
> >
> > * Context: The page is locked and has an elevated refcount. The caller
> > * should decreases the refcount once the page has been submitted for I/O
> > * and unlock the page once all I/O to that page has completed.
> > * Return: A pointer to the next page, or %NULL if we are done.
>
> Follow the refcount through.
>
> In page_cache_ra_unbounded():
>
> folio = filemap_alloc_folio(gfp_mask, 0);
> (folio has refcount 1)
> ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
> (folio has refcount 2)
>
> Then we call read_pages()
> First we call ->readahead() which for some reason stops early.
> Then we call readahead_folio() which calls folio_put()
> (folio has refcount 1)
> Then we call folio_get()
> (folio has refcount 2)
> Then we call filemap_remove_folio()
> (folio has refcount 1)
> Then we call folio_unlock()
> Then we call folio_put()
ok, I missed the refcnt from alloc_pages. However, I still think it is
a bug to call readahead_folio in read_pages as the refcnt obtained by
alloc_pages should be its final guard which is paired to the one which
checked in shrink_folio_list->__remove_mapping->folio_ref_freeze(2)(this
2 represent alloc_pages & page cache). If we removed this one without
isolating the folio from LRU, the following race could happen.
Furthermore, the refcnt dropped in the readahead_folio represents page
cache, it doesn't make sense to drop it twice in read_pages.

0. Thread_readahead:
folio_put()
folio_put_test_zero() == true
__folio_put()
folio_test_lru() == true
<preempted>

1. Thread_isolate
folio_isolate_lru
folio_test_clear_lru()
lruvec_del_folio()

2. Thread_readahead
folio_put()
folio_put_test_zero() == true
__folio_put
folio_test_lru() == true
<schedule back from 0>
lruvec_del_folio()



> (folio has refcount 0 and is freed)
>
> Yes, other things can happen in there to increment the refcount, so this
> folio_put() might not be the last put, but we hold the folio locked the
> entire time, so many things which might be attempted will block on the
> folio lock. In particular, nobody can remove it from the page cache,
> so its refcount cannot reach 0 until the last folio_put() of the
> sequence.

2024-03-28 03:18:45

by Matthew Wilcox

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Thu, Mar 28, 2024 at 09:27:31AM +0800, Zhaoyang Huang wrote:
> ok, I missed the refcnt from alloc_pages. However, I still think it is
> a bug to call readahead_folio in read_pages as the refcnt obtained by
> alloc_pages should be its final guard which is paired to the one which
> checked in shrink_folio_list->__remove_mapping->folio_ref_freeze(2)(this
> 2 represent alloc_pages & page cache). If we removed this one without

__remove_mapping() requires that the caller holds the folio locked.
Since the readahead code unlocks the folio, __remove_mapping() cannot
be run because the caller of __remove_mapping() will wait for the folio
lock.

2024-03-28 04:03:44

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Thu, Mar 28, 2024 at 11:18 AM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Mar 28, 2024 at 09:27:31AM +0800, Zhaoyang Huang wrote:
> > ok, I missed the refcnt from alloc_pages. However, I still think it is
> > a bug to call readahead_folio in read_pages as the refcnt obtained by
> > alloc_pages should be its final guard which is paired to the one which
> > checked in shrink_folio_list->__remove_mapping->folio_ref_freeze(2)(this
> > 2 represent alloc_pages & page cache). If we removed this one without
>
> __remove_mapping() requires that the caller holds the folio locked.
> Since the readahead code unlocks the folio, __remove_mapping() cannot
> be run because the caller of __remove_mapping() will wait for the folio
> lock.
repost the whole timing sequence to make it more clear and fix
incorrect description of previous feedback

Follow the refcount through.

In page_cache_ra_unbounded():

folio = filemap_alloc_folio(gfp_mask, 0);
(folio has refcount 1)
ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
(folio has refcount 2, PG_lru)

Then we call read_pages()
First we call ->readahead() which for some reason stops early.
Then we call readahead_folio() which calls folio_put()
(folio has refcount 1)
Then we call folio_get()
(folio has refcount 2)
Then we call filemap_remove_folio()
(folio has refcount 1)
Then we call folio_unlock()
Then we call folio_put()

Amending steps for previous timing sequence below where [1] races with
[2] that has nothing to do with __remove_mapping(). IMO, no file_folio
should be freed by folio_put as the refcnt obtained by alloc_pages
keep it always imbalanced until shrink_folio_list->__remove_mapping,
where the folio_ref_freeze(2) implies the refcnt of alloc_pages and
isolation should be the last two. release_pages is a special scenario
that the refcnt of alloc_pages is freed implicitly in
delete_from_page_cache_batch->filemap_free_folio.

folio_put()
{
if(folio_put_test_zero())
*** we should NOT be here as the refcnt of alloc_pages should NOT be dropped ***
if (folio_test_lru())
*** preempted here with refcnt == 0 and pass PG_lru check ***
[1]
lruvec_del_folio()
Then thread_isolate call folio_isolate_lru()
folio_isolate_lru()
{
folio_test_clear_lru()
folio_get()
[2]
lruvec_del_folio()
}
--------------------------------------------------------------------------------------------
shrink_folio_list()
{
__remove_mapping()
{
refcount = 1 + folio_nr_pages;
*** the refcount = 1 + 1 implies there should be only the refcnt of
alloc_pages and previous isolation for a no-busy folio as all PTE has
gone***
if (!folio_ref_freeze(refcount))
goto keeplock;
}
}

2024-03-28 14:13:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Thu, Mar 28, 2024 at 12:03:02PM +0800, Zhaoyang Huang wrote:
> On Thu, Mar 28, 2024 at 11:18 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Thu, Mar 28, 2024 at 09:27:31AM +0800, Zhaoyang Huang wrote:
> > > ok, I missed the refcnt from alloc_pages. However, I still think it is
> > > a bug to call readahead_folio in read_pages as the refcnt obtained by
> > > alloc_pages should be its final guard which is paired to the one which
> > > checked in shrink_folio_list->__remove_mapping->folio_ref_freeze(2)(this
> > > 2 represent alloc_pages & page cache). If we removed this one without
> >
> > __remove_mapping() requires that the caller holds the folio locked.
> > Since the readahead code unlocks the folio, __remove_mapping() cannot
> > be run because the caller of __remove_mapping() will wait for the folio
> > lock.
> repost the whole timing sequence to make it more clear and fix
> incorrect description of previous feedback

I can't understand what you think the problem is here. Please try
again.

> Follow the refcount through.
>
> In page_cache_ra_unbounded():
>
> folio = filemap_alloc_folio(gfp_mask, 0);
> (folio has refcount 1)
> ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
> (folio has refcount 2, PG_lru)
>
> Then we call read_pages()
> First we call ->readahead() which for some reason stops early.
> Then we call readahead_folio() which calls folio_put()
> (folio has refcount 1)
> Then we call folio_get()
> (folio has refcount 2)
> Then we call filemap_remove_folio()
> (folio has refcount 1)
> Then we call folio_unlock()
> Then we call folio_put()
>
> Amending steps for previous timing sequence below where [1] races with
> [2] that has nothing to do with __remove_mapping(). IMO, no file_folio
> should be freed by folio_put as the refcnt obtained by alloc_pages
> keep it always imbalanced until shrink_folio_list->__remove_mapping,
> where the folio_ref_freeze(2) implies the refcnt of alloc_pages and
> isolation should be the last two. release_pages is a special scenario
> that the refcnt of alloc_pages is freed implicitly in
> delete_from_page_cache_batch->filemap_free_folio.
>
> folio_put()
> {
> if(folio_put_test_zero())
> *** we should NOT be here as the refcnt of alloc_pages should NOT be dropped ***
> if (folio_test_lru())
> *** preempted here with refcnt == 0 and pass PG_lru check ***
> [1]
> lruvec_del_folio()
> Then thread_isolate call folio_isolate_lru()
> folio_isolate_lru()
> {
> folio_test_clear_lru()
> folio_get()
> [2]
> lruvec_del_folio()
> }
> --------------------------------------------------------------------------------------------
> shrink_folio_list()
> {
> __remove_mapping()
> {
> refcount = 1 + folio_nr_pages;
> *** the refcount = 1 + 1 implies there should be only the refcnt of
> alloc_pages and previous isolation for a no-busy folio as all PTE has
> gone***
> if (!folio_ref_freeze(refcount))
> goto keeplock;
> }
> }

2024-03-29 05:49:29

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Thu, Mar 28, 2024 at 10:12 PM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Mar 28, 2024 at 12:03:02PM +0800, Zhaoyang Huang wrote:
> > On Thu, Mar 28, 2024 at 11:18 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Thu, Mar 28, 2024 at 09:27:31AM +0800, Zhaoyang Huang wrote:
> > > > ok, I missed the refcnt from alloc_pages. However, I still think it is
> > > > a bug to call readahead_folio in read_pages as the refcnt obtained by
> > > > alloc_pages should be its final guard which is paired to the one which
> > > > checked in shrink_folio_list->__remove_mapping->folio_ref_freeze(2)(this
> > > > 2 represent alloc_pages & page cache). If we removed this one without
> > >
> > > __remove_mapping() requires that the caller holds the folio locked.
> > > Since the readahead code unlocks the folio, __remove_mapping() cannot
> > > be run because the caller of __remove_mapping() will wait for the folio
> > > lock.
> > repost the whole timing sequence to make it more clear and fix
> > incorrect description of previous feedback
>
> I can't understand what you think the problem is here. Please try
> again.
>
> > Follow the refcount through.
> >
> > In page_cache_ra_unbounded():
> >
> > folio = filemap_alloc_folio(gfp_mask, 0);
> > (folio has refcount 1)
> > ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
> > (folio has refcount 2, PG_lru)
> >
> > Then we call read_pages()
> > First we call ->readahead() which for some reason stops early.
> > Then we call readahead_folio() which calls folio_put()
> > (folio has refcount 1)
> > Then we call folio_get()
> > (folio has refcount 2)
> > Then we call filemap_remove_folio()
> > (folio has refcount 1)
> > Then we call folio_unlock()
> > Then we call folio_put()
> >
> > Amending steps for previous timing sequence below where [1] races with
> > [2] that has nothing to do with __remove_mapping(). IMO, no file_folio
> > should be freed by folio_put as the refcnt obtained by alloc_pages
> > keep it always imbalanced until shrink_folio_list->__remove_mapping,
> > where the folio_ref_freeze(2) implies the refcnt of alloc_pages and
> > isolation should be the last two. release_pages is a special scenario
> > that the refcnt of alloc_pages is freed implicitly in
> > delete_from_page_cache_batch->filemap_free_folio.
> >
> > folio_put()
> > {
> > if(folio_put_test_zero())
> > *** we should NOT be here as the refcnt of alloc_pages should NOT be dropped ***
> > if (folio_test_lru())
> > *** preempted here with refcnt == 0 and pass PG_lru check ***
> > [1]
> > lruvec_del_folio()
> > Then thread_isolate call folio_isolate_lru()
> > folio_isolate_lru()
> > {
> > folio_test_clear_lru()
> > folio_get()
> > [2]
> > lruvec_del_folio()
> > }
> > --------------------------------------------------------------------------------------------
> > shrink_folio_list()
> > {
> > __remove_mapping()
> > {
> > refcount = 1 + folio_nr_pages;
> > *** the refcount = 1 + 1 implies there should be only the refcnt of
> > alloc_pages and previous isolation for a no-busy folio as all PTE has
> > gone***
> > if (!folio_ref_freeze(refcount))
> > goto keeplock;
> > }
> > }
key steps in brief:
Thread_truncate get folio to its local fbatch by find_get_entry in step 2
The refcnt is deducted to 1 which is not as expect as from alloc_pages
but from thread_truncate's local fbatch in step 7
Thread_reclaim succeed to isolate the folio by the wrong refcnt(not
the value but meaning) in step 8
Thread_truncate hit the VM_BUG_ON in step 9

all steps:
Thread_readahead:
0. folio = filemap_alloc_folio(gfp_mask, 0);
(folio has refcount 1)
1. ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
(folio has refcount 2)
2. thread_truncate hold one refcnt and add this folio to fbatch_truncate
(folio has refcount 3(alloc, page cache, fbatch_truncate), PG_lru)
3. Then we call read_pages()
First we call ->readahead() which for some reason stops early.
4. Then we call readahead_folio() which calls folio_put()
(folio has refcount 2)
5. Then we call folio_get()
(folio has refcount 3)
6. Then we call filemap_remove_folio()
(folio has refcount 2)
7. Then we call folio_unlock()
Then we call folio_put()
(folio has refcount 1(fbatch_truncate))
8. thread_reclaim call shrink_inactive_list->isolate_lru_folios
shrink_inactive_list
isolate_lru_folios
if (!folio_test_lru(folio))
if (!folio_try_get(folio))
if (!folio_test_clear_lru(folio))
list_move(folio, dst)
(folio has refcount 2)

8.1. thread_reclaim call shrink_folio_list->__remove_mapping
shrink_folio_list()
__remove_mapping()
(refcount = 2)
if (!folio_ref_freeze(2)) //true
list_add(folio, free_folios);
(folio has refcount 0)

9. thread_truncate will hit the refcnt VM_BUG_ON(refcnt == 0) in
folio_put_testzero

2024-03-29 12:20:02

by Matthew Wilcox

[permalink] [raw]
Subject: Re: summarize all information again at bottom//reply: reply: [PATCH] mm: fix a race scenario in folio_isolate_lru

On Fri, Mar 29, 2024 at 01:49:05PM +0800, Zhaoyang Huang wrote:
> On Thu, Mar 28, 2024 at 10:12 PM Matthew Wilcox <[email protected]> wrote:
> key steps in brief:
> Thread_truncate get folio to its local fbatch by find_get_entry in step 2
> The refcnt is deducted to 1 which is not as expect as from alloc_pages
> but from thread_truncate's local fbatch in step 7
> Thread_reclaim succeed to isolate the folio by the wrong refcnt(not
> the value but meaning) in step 8
> Thread_truncate hit the VM_BUG_ON in step 9
>
> all steps:
> Thread_readahead:
> 0. folio = filemap_alloc_folio(gfp_mask, 0);
> (folio has refcount 1)
> 1. ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
> (folio has refcount 2)
> 2. thread_truncate hold one refcnt and add this folio to fbatch_truncate
> (folio has refcount 3(alloc, page cache, fbatch_truncate), PG_lru)
> 3. Then we call read_pages()
> First we call ->readahead() which for some reason stops early.
> 4. Then we call readahead_folio() which calls folio_put()
> (folio has refcount 2)
> 5. Then we call folio_get()
> (folio has refcount 3)
> 6. Then we call filemap_remove_folio()
> (folio has refcount 2)
> 7. Then we call folio_unlock()
> Then we call folio_put()
> (folio has refcount 1(fbatch_truncate))
> 8. thread_reclaim call shrink_inactive_list->isolate_lru_folios
> shrink_inactive_list
> isolate_lru_folios
> if (!folio_test_lru(folio))
> if (!folio_try_get(folio))
> if (!folio_test_clear_lru(folio))
> list_move(folio, dst)
> (folio has refcount 2)
>
> 8.1. thread_reclaim call shrink_folio_list->__remove_mapping
> shrink_folio_list()
> __remove_mapping()
> (refcount = 2)
> if (!folio_ref_freeze(2)) //true
> list_add(folio, free_folios);
> (folio has refcount 0)
>
> 9. thread_truncate will hit the refcnt VM_BUG_ON(refcnt == 0) in
> folio_put_testzero

But now you're talking about something _entirely different_ that isn't
the bug you hit. isolate_lru_folios is not isolate_lru_folio.

I am disinclined to pick through this example to find out why you're
wrong again. I'm also disinclined to continue this correspondance.
We're not making any progress here.