There is a race during page offline that can lead to infinite loop:
a page never ends up on a buddy list and __offline_pages() keeps
retrying infinitely or until a termination signal is received.
Thread#1 - a new process:
load_elf_binary
begin_new_exec
exec_mmap
mmput
exit_mmap
tlb_finish_mmu
tlb_flush_mmu
release_pages
free_unref_page_list
free_unref_page_prepare
set_pcppage_migratetype(page, migratetype);
// Set page->index migration type below MIGRATE_PCPTYPES
Thread#2 - hot-removes memory
__offline_pages
start_isolate_page_range
set_migratetype_isolate
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
Set migration type to MIGRATE_ISOLATE-> set
drain_all_pages(zone);
// drain per-cpu page lists to buddy allocator.
Thread#1 - continue
free_unref_page_commit
migratetype = get_pcppage_migratetype(page);
// get old migration type
list_add(&page->lru, &pcp->lists[migratetype]);
// add new page to already drained pcp list
Thread#2
Never drains pcp again, and therefore gets stuck in the loop.
The fix is to try to drain per-cpu lists again after
check_pages_isolated_cb() fails.
Signed-off-by: Pavel Tatashin <[email protected]>
Cc: [email protected]
---
mm/memory_hotplug.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e9d5ab5d3ca0..d6d54922bfce 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
/* check again */
ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
NULL, check_pages_isolated_cb);
+ /*
+ * per-cpu pages are drained in start_isolate_page_range, but if
+ * there are still pages that are not free, make sure that we
+ * drain again, because when we isolated range we might
+ * have raced with another thread that was adding pages to
+ * pcp list.
+ */
+ if (ret)
+ drain_all_pages(zone);
} while (ret);
/* Ok, all of our target is isolated.
--
2.25.1
Hi
[This is an automated email]
This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: all
The bot has tested the following trees: v5.8.5, v5.4.61, v4.19.142, v4.14.195, v4.9.234, v4.4.234.
v5.8.5: Build OK!
v5.4.61: Build OK!
v4.19.142: Failed to apply! Possible dependencies:
2932c8b05056 ("mm, memory_hotplug: be more verbose for memory offline failures")
5557c766abad ("mm, memory_hotplug: cleanup memory offline path")
7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure")
7c2ee349cf79 ("memblock: rename __free_pages_bootmem to memblock_free_pages")
9b7ea46a82b3 ("mm/hotplug: fix offline undo_isolate_page_range()")
a9cd410a3d29 ("mm/page_alloc.c: memory hotplug: free pages as higher order")
bb8965bd82fd ("mm, memory_hotplug: deobfuscate migration part of offlining")
d381c54760dc ("mm: only report isolation failures when offlining memory")
v4.14.195: Failed to apply! Possible dependencies:
1b7176aea0a9 ("memory hotplug: fix comments when adding section")
24e6d5a59ac7 ("mm: pass the vmem_altmap to arch_add_memory and __add_pages")
2f47a91f4dab ("mm: deferred_init_memmap improvements")
381eab4a6ee8 ("mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock")
7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure")
7b73d978a5d0 ("mm: pass the vmem_altmap to vmemmap_populate")
80b1f41c0957 ("mm: split deferred_init_range into initializing and freeing parts")
9bb5a391f9a5 ("mm, memory_hotplug: fix memmap initialization")
b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node")
bb8965bd82fd ("mm, memory_hotplug: deobfuscate migration part of offlining")
d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
e8b098fc5747 ("mm: kernel-doc: add missing parameter descriptions")
f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
v4.9.234: Failed to apply! Possible dependencies:
1b862aecfbd4 ("mm, memory_hotplug: get rid of is_zone_device_section")
381eab4a6ee8 ("mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock")
385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters")
438cc81a41e8 ("powerpc/pseries: Automatically resize HPT for memory hot add/remove")
72675e131eb4 ("mm, memory_hotplug: drop zone from build_all_zonelists")
7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure")
88ed365ea227 ("mm: don't steal highatomic pageblock")
a6ffdc07847e ("mm: use is_migrate_highatomic() to simplify the code")
b93e0f329e24 ("mm, memory_hotplug: get rid of zonelists_mutex")
bb8965bd82fd ("mm, memory_hotplug: deobfuscate migration part of offlining")
c8f9565716e3 ("mm, memory_hotplug: use node instead of zone in can_online_high_movable")
f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
v4.4.234: Failed to apply! Possible dependencies:
0caeef63e6d2 ("libnvdimm: Add a poison list and export badblocks")
0e749e54244e ("dax: increase granularity of dax_clear_blocks() operations")
1b862aecfbd4 ("mm, memory_hotplug: get rid of is_zone_device_section")
260ae3f7db61 ("mm: skip memory block registration for ZONE_DEVICE")
34c0fd540e79 ("mm, dax, pmem: introduce pfn_t")
381eab4a6ee8 ("mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock")
4a65429457a5 ("s390/mm: fix zone calculation in arch_add_memory()")
4b94ffdc4163 ("x86, mm: introduce vmem_altmap to augment vmemmap_populate()")
7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure")
87ba05dff351 ("libnvdimm: don't fail init for full badblocks list")
ad9a8bde2cb1 ("libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h")
b95f5f4391fa ("libnvdimm: convert to statically allocated badblocks")
bb8965bd82fd ("mm, memory_hotplug: deobfuscate migration part of offlining")
f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
NOTE: The patch will not be queued to stable trees until it is upstream.
How should we proceed with this patch?
--
Thanks
Sasha
On Tue, 1 Sep 2020, Pavel Tatashin wrote:
> There is a race during page offline that can lead to infinite loop:
> a page never ends up on a buddy list and __offline_pages() keeps
> retrying infinitely or until a termination signal is received.
>
> Thread#1 - a new process:
>
> load_elf_binary
> begin_new_exec
> exec_mmap
> mmput
> exit_mmap
> tlb_finish_mmu
> tlb_flush_mmu
> release_pages
> free_unref_page_list
> free_unref_page_prepare
> set_pcppage_migratetype(page, migratetype);
> // Set page->index migration type below MIGRATE_PCPTYPES
>
> Thread#2 - hot-removes memory
> __offline_pages
> start_isolate_page_range
> set_migratetype_isolate
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> Set migration type to MIGRATE_ISOLATE-> set
> drain_all_pages(zone);
> // drain per-cpu page lists to buddy allocator.
>
> Thread#1 - continue
> free_unref_page_commit
> migratetype = get_pcppage_migratetype(page);
> // get old migration type
> list_add(&page->lru, &pcp->lists[migratetype]);
> // add new page to already drained pcp list
>
> Thread#2
> Never drains pcp again, and therefore gets stuck in the loop.
>
> The fix is to try to drain per-cpu lists again after
> check_pages_isolated_cb() fails.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> Cc: [email protected]
Acked-by: David Rientjes <[email protected]>
[Cc Mel and Vlastimil - I am still rummaging]
On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
> There is a race during page offline that can lead to infinite loop:
> a page never ends up on a buddy list and __offline_pages() keeps
> retrying infinitely or until a termination signal is received.
>
> Thread#1 - a new process:
>
> load_elf_binary
> begin_new_exec
> exec_mmap
> mmput
> exit_mmap
> tlb_finish_mmu
> tlb_flush_mmu
> release_pages
> free_unref_page_list
> free_unref_page_prepare
> set_pcppage_migratetype(page, migratetype);
> // Set page->index migration type below MIGRATE_PCPTYPES
>
> Thread#2 - hot-removes memory
> __offline_pages
> start_isolate_page_range
> set_migratetype_isolate
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> Set migration type to MIGRATE_ISOLATE-> set
> drain_all_pages(zone);
> // drain per-cpu page lists to buddy allocator.
>
> Thread#1 - continue
> free_unref_page_commit
> migratetype = get_pcppage_migratetype(page);
> // get old migration type
> list_add(&page->lru, &pcp->lists[migratetype]);
> // add new page to already drained pcp list
>
> Thread#2
> Never drains pcp again, and therefore gets stuck in the loop.
>
> The fix is to try to drain per-cpu lists again after
> check_pages_isolated_cb() fails.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> Cc: [email protected]
> ---
> mm/memory_hotplug.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..d6d54922bfce 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
> /* check again */
> ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> NULL, check_pages_isolated_cb);
> + /*
> + * per-cpu pages are drained in start_isolate_page_range, but if
> + * there are still pages that are not free, make sure that we
> + * drain again, because when we isolated range we might
> + * have raced with another thread that was adding pages to
> + * pcp list.
> + */
> + if (ret)
> + drain_all_pages(zone);
> } while (ret);
>
> /* Ok, all of our target is isolated.
> --
> 2.25.1
>
--
Michal Hocko
SUSE Labs
On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
> There is a race during page offline that can lead to infinite loop:
> a page never ends up on a buddy list and __offline_pages() keeps
> retrying infinitely or until a termination signal is received.
>
> Thread#1 - a new process:
>
> load_elf_binary
> begin_new_exec
> exec_mmap
> mmput
> exit_mmap
> tlb_finish_mmu
> tlb_flush_mmu
> release_pages
> free_unref_page_list
> free_unref_page_prepare
> set_pcppage_migratetype(page, migratetype);
> // Set page->index migration type below MIGRATE_PCPTYPES
>
> Thread#2 - hot-removes memory
> __offline_pages
> start_isolate_page_range
> set_migratetype_isolate
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> Set migration type to MIGRATE_ISOLATE-> set
> drain_all_pages(zone);
> // drain per-cpu page lists to buddy allocator.
It is not really clear to me how we could have passed
has_unmovable_pages at this stage when the page is not PageBuddy. Is
this because you are using Movable Zones?
>
> Thread#1 - continue
> free_unref_page_commit
> migratetype = get_pcppage_migratetype(page);
> // get old migration type
> list_add(&page->lru, &pcp->lists[migratetype]);
> // add new page to already drained pcp list
>
> Thread#2
> Never drains pcp again, and therefore gets stuck in the loop.
>
> The fix is to try to drain per-cpu lists again after
> check_pages_isolated_cb() fails.
But this means that the page is not isolated and so it could be reused
for something else. No?
> Signed-off-by: Pavel Tatashin <[email protected]>
> Cc: [email protected]
> ---
> mm/memory_hotplug.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..d6d54922bfce 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
> /* check again */
> ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> NULL, check_pages_isolated_cb);
> + /*
> + * per-cpu pages are drained in start_isolate_page_range, but if
> + * there are still pages that are not free, make sure that we
> + * drain again, because when we isolated range we might
> + * have raced with another thread that was adding pages to
> + * pcp list.
> + */
> + if (ret)
> + drain_all_pages(zone);
> } while (ret);
>
> /* Ok, all of our target is isolated.
> --
> 2.25.1
>
--
Michal Hocko
SUSE Labs
On Wed 02-09-20 16:01:17, Michal Hocko wrote:
> [Cc Mel and Vlastimil - I am still rummaging]
>
> On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
> > There is a race during page offline that can lead to infinite loop:
> > a page never ends up on a buddy list and __offline_pages() keeps
> > retrying infinitely or until a termination signal is received.
> >
> > Thread#1 - a new process:
> >
> > load_elf_binary
> > begin_new_exec
> > exec_mmap
> > mmput
> > exit_mmap
> > tlb_finish_mmu
> > tlb_flush_mmu
> > release_pages
> > free_unref_page_list
> > free_unref_page_prepare
> > set_pcppage_migratetype(page, migratetype);
> > // Set page->index migration type below MIGRATE_PCPTYPES
> >
> > Thread#2 - hot-removes memory
> > __offline_pages
> > start_isolate_page_range
> > set_migratetype_isolate
> > set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> > Set migration type to MIGRATE_ISOLATE-> set
> > drain_all_pages(zone);
> > // drain per-cpu page lists to buddy allocator.
> >
> > Thread#1 - continue
> > free_unref_page_commit
> > migratetype = get_pcppage_migratetype(page);
> > // get old migration type
> > list_add(&page->lru, &pcp->lists[migratetype]);
> > // add new page to already drained pcp list
> >
> > Thread#2
> > Never drains pcp again, and therefore gets stuck in the loop.
> >
> > The fix is to try to drain per-cpu lists again after
> > check_pages_isolated_cb() fails.
Still trying to wrap my head around this but I think this is not a
proper fix. It should be the page isolation to make sure no races are
possible with the page freeing path.
> > Signed-off-by: Pavel Tatashin <[email protected]>
> > Cc: [email protected]
> > ---
> > mm/memory_hotplug.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index e9d5ab5d3ca0..d6d54922bfce 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
> > /* check again */
> > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > NULL, check_pages_isolated_cb);
> > + /*
> > + * per-cpu pages are drained in start_isolate_page_range, but if
> > + * there are still pages that are not free, make sure that we
> > + * drain again, because when we isolated range we might
> > + * have raced with another thread that was adding pages to
> > + * pcp list.
> > + */
> > + if (ret)
> > + drain_all_pages(zone);
> > } while (ret);
> >
> > /* Ok, all of our target is isolated.
> > --
> > 2.25.1
> >
>
> --
> Michal Hocko
> SUSE Labs
--
Michal Hocko
SUSE Labs
On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
> > There is a race during page offline that can lead to infinite loop:
> > a page never ends up on a buddy list and __offline_pages() keeps
> > retrying infinitely or until a termination signal is received.
> >
> > Thread#1 - a new process:
> >
> > load_elf_binary
> > begin_new_exec
> > exec_mmap
> > mmput
> > exit_mmap
> > tlb_finish_mmu
> > tlb_flush_mmu
> > release_pages
> > free_unref_page_list
> > free_unref_page_prepare
> > set_pcppage_migratetype(page, migratetype);
> > // Set page->index migration type below MIGRATE_PCPTYPES
> >
> > Thread#2 - hot-removes memory
> > __offline_pages
> > start_isolate_page_range
> > set_migratetype_isolate
> > set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> > Set migration type to MIGRATE_ISOLATE-> set
> > drain_all_pages(zone);
> > // drain per-cpu page lists to buddy allocator.
>
> It is not really clear to me how we could have passed
> has_unmovable_pages at this stage when the page is not PageBuddy. Is
> this because you are using Movable Zones?
Yes, we hot-remove memory from the movable zone.
>
> >
> > Thread#1 - continue
> > free_unref_page_commit
> > migratetype = get_pcppage_migratetype(page);
> > // get old migration type
> > list_add(&page->lru, &pcp->lists[migratetype]);
> > // add new page to already drained pcp list
> >
> > Thread#2
> > Never drains pcp again, and therefore gets stuck in the loop.
> >
> > The fix is to try to drain per-cpu lists again after
> > check_pages_isolated_cb() fails.
>
> But this means that the page is not isolated and so it could be reused
> for something else. No?
The page is in a movable zone, has zero references, and the section is
isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
set. The page should be offlinable, but it is lost in a pcp list as
that list is never drained again after the first failure to migrate
all pages in the range.
>
> > Signed-off-by: Pavel Tatashin <[email protected]>
> > Cc: [email protected]
> > ---
> > mm/memory_hotplug.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index e9d5ab5d3ca0..d6d54922bfce 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
> > /* check again */
> > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > NULL, check_pages_isolated_cb);
> > + /*
> > + * per-cpu pages are drained in start_isolate_page_range, but if
> > + * there are still pages that are not free, make sure that we
> > + * drain again, because when we isolated range we might
> > + * have raced with another thread that was adding pages to
> > + * pcp list.
> > + */
> > + if (ret)
> > + drain_all_pages(zone);
> > } while (ret);
> >
> > /* Ok, all of our target is isolated.
> > --
> > 2.25.1
> >
>
> --
> Michal Hocko
> SUSE Labs
> > > The fix is to try to drain per-cpu lists again after
> > > check_pages_isolated_cb() fails.
>
> Still trying to wrap my head around this but I think this is not a
> proper fix. It should be the page isolation to make sure no races are
> possible with the page freeing path.
>
As Bharata B Rao found in another thread, the problem was introduced
by this change:
c52e75935f8d: mm: remove extra drain pages on pcp list
So, the drain used to be tried every time with lru_add_drain_all();
Which, I think is excessive, as we start a thread per cpu to try to
drain and catch a rare race condition. With the proposed change we
drain again only when we find such a condition. Fixing it in
start_isolate_page_range means that we must somehow synchronize it
with the release_pages() which adds costs to runtime code, instead of
to hot-remove code.
Pasha
On 9/2/20 4:31 PM, Pavel Tatashin wrote:
>> > > The fix is to try to drain per-cpu lists again after
>> > > check_pages_isolated_cb() fails.
>>
>> Still trying to wrap my head around this but I think this is not a
>> proper fix. It should be the page isolation to make sure no races are
>> possible with the page freeing path.
>>
>
> As Bharata B Rao found in another thread, the problem was introduced
> by this change:
> c52e75935f8d: mm: remove extra drain pages on pcp list
>
> So, the drain used to be tried every time with lru_add_drain_all();
> Which, I think is excessive, as we start a thread per cpu to try to
> drain and catch a rare race condition. With the proposed change we
> drain again only when we find such a condition. Fixing it in
> start_isolate_page_range means that we must somehow synchronize it
> with the release_pages() which adds costs to runtime code, instead of
> to hot-remove code.
Agreed. Isolation was always racy wrt freeing to pcplists, and it was simply
acceptable to do some extra drains if needed. Removing that race would be indeed
acceptable only if it didn't affect alloc/free fastpaths.
> Pasha
>
On 9/2/20 4:26 PM, Pavel Tatashin wrote:
> On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <[email protected]> wrote:
>>
>> >
>> > Thread#1 - continue
>> > free_unref_page_commit
>> > migratetype = get_pcppage_migratetype(page);
>> > // get old migration type
>> > list_add(&page->lru, &pcp->lists[migratetype]);
>> > // add new page to already drained pcp list
>> >
>> > Thread#2
>> > Never drains pcp again, and therefore gets stuck in the loop.
>> >
>> > The fix is to try to drain per-cpu lists again after
>> > check_pages_isolated_cb() fails.
>>
>> But this means that the page is not isolated and so it could be reused
>> for something else. No?
>
> The page is in a movable zone, has zero references, and the section is
> isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
> set. The page should be offlinable, but it is lost in a pcp list as
> that list is never drained again after the first failure to migrate
> all pages in the range.
Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
could allocate it from the pcplist before we do the extra drain. But then it
becomes "visible again" and the loop in __offline_pages() should catch it by
scan_movable_pages() - do_migrate_range(). And this time the pageblock is
already marked as isolated, so the page (freed by migration) won't end up on the
pcplist again.
On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
> > On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <[email protected]> wrote:
> >>
> >> >
> >> > Thread#1 - continue
> >> > free_unref_page_commit
> >> > migratetype = get_pcppage_migratetype(page);
> >> > // get old migration type
> >> > list_add(&page->lru, &pcp->lists[migratetype]);
> >> > // add new page to already drained pcp list
> >> >
> >> > Thread#2
> >> > Never drains pcp again, and therefore gets stuck in the loop.
> >> >
> >> > The fix is to try to drain per-cpu lists again after
> >> > check_pages_isolated_cb() fails.
> >>
> >> But this means that the page is not isolated and so it could be reused
> >> for something else. No?
> >
> > The page is in a movable zone, has zero references, and the section is
> > isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
> > set. The page should be offlinable, but it is lost in a pcp list as
> > that list is never drained again after the first failure to migrate
> > all pages in the range.
>
> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
> could allocate it from the pcplist before we do the extra drain. But then it
> becomes "visible again" and the loop in __offline_pages() should catch it by
> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
> already marked as isolated, so the page (freed by migration) won't end up on the
> pcplist again.
So the page block is marked MIGRATE_ISOLATE but the allocation itself
could be used for non migrateable objects. Or does anything prevent that
from happening?
We really do depend on isolation to not allow reuse when offlining.
--
Michal Hocko
SUSE Labs
> > >> But this means that the page is not isolated and so it could be reused
> > >> for something else. No?
> > >
> > > The page is in a movable zone, has zero references, and the section is
> > > isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
> > > set. The page should be offlinable, but it is lost in a pcp list as
> > > that list is never drained again after the first failure to migrate
> > > all pages in the range.
> >
> > Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
> > could allocate it from the pcplist before we do the extra drain. But then it
> > becomes "visible again" and the loop in __offline_pages() should catch it by
> > scan_movable_pages() - do_migrate_range(). And this time the pageblock is
> > already marked as isolated, so the page (freed by migration) won't end up on the
> > pcplist again.
>
> So the page block is marked MIGRATE_ISOLATE but the allocation itself
> could be used for non migrateable objects. Or does anything prevent that
> from happening?
Vlastimil is right, we could allocate from pcplist, if someone
requests allocation, nothing from what I can tell prevents that, and
we will immediately migrate that page in do_migrate_range().
> We really do depend on isolation to not allow reuse when offlining.
Once a page is isolated it is not re-used but here this page is not
isolated because of the race between adding to pcp and isolation.
Draining the second time on a failure fixes the race.
Pasha
On 9/2/20 5:13 PM, Michal Hocko wrote:
> On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
>> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
>> > On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <[email protected]> wrote:
>> >>
>> >> >
>> >> > Thread#1 - continue
>> >> > free_unref_page_commit
>> >> > migratetype = get_pcppage_migratetype(page);
>> >> > // get old migration type
>> >> > list_add(&page->lru, &pcp->lists[migratetype]);
>> >> > // add new page to already drained pcp list
>> >> >
>> >> > Thread#2
>> >> > Never drains pcp again, and therefore gets stuck in the loop.
>> >> >
>> >> > The fix is to try to drain per-cpu lists again after
>> >> > check_pages_isolated_cb() fails.
>> >>
>> >> But this means that the page is not isolated and so it could be reused
>> >> for something else. No?
>> >
>> > The page is in a movable zone, has zero references, and the section is
>> > isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
>> > set. The page should be offlinable, but it is lost in a pcp list as
>> > that list is never drained again after the first failure to migrate
>> > all pages in the range.
>>
>> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
>> could allocate it from the pcplist before we do the extra drain. But then it
>> becomes "visible again" and the loop in __offline_pages() should catch it by
>> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
>> already marked as isolated, so the page (freed by migration) won't end up on the
>> pcplist again.
>
> So the page block is marked MIGRATE_ISOLATE but the allocation itself
> could be used for non migrateable objects. Or does anything prevent that
> from happening?
In a movable zone, the allocation should not be used for non migrateable
objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
regardless of this race (analogically for migrating away from CMA pageblocks).
> We really do depend on isolation to not allow reuse when offlining.
This is not really different than if the page on pcplist was allocated just a
moment before the offlining, thus isolation started. We ultimately rely on being
able to migrate any allocated pages away during the isolation. This "freeing to
pcplists" race doesn't fundamentally change anything in this regard. We just
have to guarantee that pages on pcplists will be eventually flushed, to make
forward progress, and there was a bug in this aspect.
On Wed 02-09-20 19:51:45, Vlastimil Babka wrote:
> On 9/2/20 5:13 PM, Michal Hocko wrote:
> > On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
> >> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
> >> > On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <[email protected]> wrote:
> >> >>
> >> >> >
> >> >> > Thread#1 - continue
> >> >> > free_unref_page_commit
> >> >> > migratetype = get_pcppage_migratetype(page);
> >> >> > // get old migration type
> >> >> > list_add(&page->lru, &pcp->lists[migratetype]);
> >> >> > // add new page to already drained pcp list
> >> >> >
> >> >> > Thread#2
> >> >> > Never drains pcp again, and therefore gets stuck in the loop.
> >> >> >
> >> >> > The fix is to try to drain per-cpu lists again after
> >> >> > check_pages_isolated_cb() fails.
> >> >>
> >> >> But this means that the page is not isolated and so it could be reused
> >> >> for something else. No?
> >> >
> >> > The page is in a movable zone, has zero references, and the section is
> >> > isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
> >> > set. The page should be offlinable, but it is lost in a pcp list as
> >> > that list is never drained again after the first failure to migrate
> >> > all pages in the range.
> >>
> >> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
> >> could allocate it from the pcplist before we do the extra drain. But then it
> >> becomes "visible again" and the loop in __offline_pages() should catch it by
> >> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
> >> already marked as isolated, so the page (freed by migration) won't end up on the
> >> pcplist again.
> >
> > So the page block is marked MIGRATE_ISOLATE but the allocation itself
> > could be used for non migrateable objects. Or does anything prevent that
> > from happening?
>
> In a movable zone, the allocation should not be used for non migrateable
> objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
> regardless of this race (analogically for migrating away from CMA pageblocks).
>
> > We really do depend on isolation to not allow reuse when offlining.
>
> This is not really different than if the page on pcplist was allocated just a
> moment before the offlining, thus isolation started. We ultimately rely on being
> able to migrate any allocated pages away during the isolation. This "freeing to
> pcplists" race doesn't fundamentally change anything in this regard. We just
> have to guarantee that pages on pcplists will be eventually flushed, to make
> forward progress, and there was a bug in this aspect.
You are right. I managed to confuse myself yesterday. The race is
impossible for !ZONE_MOVABLE because we do PageBuddy check there. And on
the movable zone we are not losing the migrateability property.
Pavel I think this will be a useful information to add to the changelog.
We should also document this in the code to prevent from further
confusion. I would suggest something like the following:
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 242c03121d73..56d4892bceb8 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
* pageblocks we may have modified and return -EBUSY to caller. This
* prevents two threads from simultaneously working on overlapping ranges.
*
+ * Please note that there is no strong synchronization with the page allocator
+ * either. Pages might be freed while their page blocks are marked ISOLATED.
+ * In some cases pages might still end up on pcp lists and that would allow
+ * for their allocation even when they are in fact isolated already. Depending on
+ * how strong of a guarantee the caller needs drain_all_pages might be needed
+ * (e.g. __offline_pages will need to call it after check for isolated range for
+ * a next retry).
+ *
* Return: the number of isolated pageblocks on success and -EBUSY if any part
* of range cannot be isolated.
*/
--
Michal Hocko
SUSE Labs
On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
[...]
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e9d5ab5d3ca0..d6d54922bfce 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
> /* check again */
> ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> NULL, check_pages_isolated_cb);
> + /*
> + * per-cpu pages are drained in start_isolate_page_range, but if
> + * there are still pages that are not free, make sure that we
> + * drain again, because when we isolated range we might
> + * have raced with another thread that was adding pages to
> + * pcp list.
I would also add
* Forward progress should be still guaranteed because
* pages on the pcp list can only belong to MOVABLE_ZONE
* because has_unmovable_pages explicitly checks for
* PageBuddy on freed pages on other zones.
> + */
> + if (ret)
> + drain_all_pages(zone);
> } while (ret);
>
> /* Ok, all of our target is isolated.
> --
> 2.25.1
>
--
Michal Hocko
SUSE Labs
On 9/1/20 2:46 PM, Pavel Tatashin wrote:
> There is a race during page offline that can lead to infinite loop:
> a page never ends up on a buddy list and __offline_pages() keeps
> retrying infinitely or until a termination signal is received.
>
> Thread#1 - a new process:
>
> load_elf_binary
> begin_new_exec
> exec_mmap
> mmput
> exit_mmap
> tlb_finish_mmu
> tlb_flush_mmu
> release_pages
> free_unref_page_list
> free_unref_page_prepare
> set_pcppage_migratetype(page, migratetype);
> // Set page->index migration type below MIGRATE_PCPTYPES
>
> Thread#2 - hot-removes memory
> __offline_pages
> start_isolate_page_range
> set_migratetype_isolate
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> Set migration type to MIGRATE_ISOLATE-> set
> drain_all_pages(zone);
> // drain per-cpu page lists to buddy allocator.
>
> Thread#1 - continue
> free_unref_page_commit
> migratetype = get_pcppage_migratetype(page);
> // get old migration type
> list_add(&page->lru, &pcp->lists[migratetype]);
> // add new page to already drained pcp list
>
> Thread#2
> Never drains pcp again, and therefore gets stuck in the loop.
>
> The fix is to try to drain per-cpu lists again after
> check_pages_isolated_cb() fails.
>
> Signed-off-by: Pavel Tatashin <[email protected]>
> Cc: [email protected]
Fixes: ?
Acked-by: Vlastimil Babka <[email protected]>
Thanks.
Thanks Michal, I will add your comments.
Pasha
On Thu, Sep 3, 2020 at 3:07 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 01-09-20 08:46:15, Pavel Tatashin wrote:
> [...]
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index e9d5ab5d3ca0..d6d54922bfce 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1575,6 +1575,15 @@ static int __ref __offline_pages(unsigned long start_pfn,
> > /* check again */
> > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > NULL, check_pages_isolated_cb);
> > + /*
> > + * per-cpu pages are drained in start_isolate_page_range, but if
> > + * there are still pages that are not free, make sure that we
> > + * drain again, because when we isolated range we might
> > + * have raced with another thread that was adding pages to
> > + * pcp list.
>
> I would also add
> * Forward progress should be still guaranteed because
> * pages on the pcp list can only belong to MOVABLE_ZONE
> * because has_unmovable_pages explicitly checks for
> * PageBuddy on freed pages on other zones.
> > + */
> > + if (ret)
> > + drain_all_pages(zone);
> > } while (ret);
> >
> > /* Ok, all of our target is isolated.
> > --
> > 2.25.1
> >
>
> --
> Michal Hocko
> SUSE Labs
On 03.09.20 08:38, Michal Hocko wrote:
> On Wed 02-09-20 19:51:45, Vlastimil Babka wrote:
>> On 9/2/20 5:13 PM, Michal Hocko wrote:
>>> On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
>>>> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
>>>>> On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> Thread#1 - continue
>>>>>>> free_unref_page_commit
>>>>>>> migratetype = get_pcppage_migratetype(page);
>>>>>>> // get old migration type
>>>>>>> list_add(&page->lru, &pcp->lists[migratetype]);
>>>>>>> // add new page to already drained pcp list
>>>>>>>
>>>>>>> Thread#2
>>>>>>> Never drains pcp again, and therefore gets stuck in the loop.
>>>>>>>
>>>>>>> The fix is to try to drain per-cpu lists again after
>>>>>>> check_pages_isolated_cb() fails.
>>>>>>
>>>>>> But this means that the page is not isolated and so it could be reused
>>>>>> for something else. No?
>>>>>
>>>>> The page is in a movable zone, has zero references, and the section is
>>>>> isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
>>>>> set. The page should be offlinable, but it is lost in a pcp list as
>>>>> that list is never drained again after the first failure to migrate
>>>>> all pages in the range.
>>>>
>>>> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
>>>> could allocate it from the pcplist before we do the extra drain. But then it
>>>> becomes "visible again" and the loop in __offline_pages() should catch it by
>>>> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
>>>> already marked as isolated, so the page (freed by migration) won't end up on the
>>>> pcplist again.
>>>
>>> So the page block is marked MIGRATE_ISOLATE but the allocation itself
>>> could be used for non migrateable objects. Or does anything prevent that
>>> from happening?
>>
>> In a movable zone, the allocation should not be used for non migrateable
>> objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
>> regardless of this race (analogically for migrating away from CMA pageblocks).
>>
>>> We really do depend on isolation to not allow reuse when offlining.
>>
>> This is not really different than if the page on pcplist was allocated just a
>> moment before the offlining, thus isolation started. We ultimately rely on being
>> able to migrate any allocated pages away during the isolation. This "freeing to
>> pcplists" race doesn't fundamentally change anything in this regard. We just
>> have to guarantee that pages on pcplists will be eventually flushed, to make
>> forward progress, and there was a bug in this aspect.
>
> You are right. I managed to confuse myself yesterday. The race is
> impossible for !ZONE_MOVABLE because we do PageBuddy check there. And on
> the movable zone we are not losing the migrateability property.
>
> Pavel I think this will be a useful information to add to the changelog.
> We should also document this in the code to prevent from further
> confusion. I would suggest something like the following:
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 242c03121d73..56d4892bceb8 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> * pageblocks we may have modified and return -EBUSY to caller. This
> * prevents two threads from simultaneously working on overlapping ranges.
> *
> + * Please note that there is no strong synchronization with the page allocator
> + * either. Pages might be freed while their page blocks are marked ISOLATED.
> + * In some cases pages might still end up on pcp lists and that would allow
> + * for their allocation even when they are in fact isolated already. Depending on
> + * how strong of a guarantee the caller needs drain_all_pages might be needed
> + * (e.g. __offline_pages will need to call it after check for isolated range for
> + * a next retry).
> + *
As expressed in reply to v2, I dislike this hack. There is strong
synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
just plain ugly.
Can't we temporarily disable PCP (while some pageblock in the zone is
isolated, which we know e.g., due to the counter), so no new pages get
put into PCP lists after draining, and re-enable after no pageblocks are
isolated again? We keep draining the PCP, so it doesn't seem to be of a
lot of use during that period, no? It's a performance hit already.
Then, we would only need exactly one drain. And we would only have to
check on the free path whether PCP is temporarily disabled.
--
Thanks,
David / dhildenb
On Thu, Sep 3, 2020 at 2:20 PM David Hildenbrand <[email protected]> wrote:
>
> On 03.09.20 08:38, Michal Hocko wrote:
> > On Wed 02-09-20 19:51:45, Vlastimil Babka wrote:
> >> On 9/2/20 5:13 PM, Michal Hocko wrote:
> >>> On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
> >>>> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
> >>>>> On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <[email protected]> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> Thread#1 - continue
> >>>>>>> free_unref_page_commit
> >>>>>>> migratetype = get_pcppage_migratetype(page);
> >>>>>>> // get old migration type
> >>>>>>> list_add(&page->lru, &pcp->lists[migratetype]);
> >>>>>>> // add new page to already drained pcp list
> >>>>>>>
> >>>>>>> Thread#2
> >>>>>>> Never drains pcp again, and therefore gets stuck in the loop.
> >>>>>>>
> >>>>>>> The fix is to try to drain per-cpu lists again after
> >>>>>>> check_pages_isolated_cb() fails.
> >>>>>>
> >>>>>> But this means that the page is not isolated and so it could be reused
> >>>>>> for something else. No?
> >>>>>
> >>>>> The page is in a movable zone, has zero references, and the section is
> >>>>> isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
> >>>>> set. The page should be offlinable, but it is lost in a pcp list as
> >>>>> that list is never drained again after the first failure to migrate
> >>>>> all pages in the range.
> >>>>
> >>>> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
> >>>> could allocate it from the pcplist before we do the extra drain. But then it
> >>>> becomes "visible again" and the loop in __offline_pages() should catch it by
> >>>> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
> >>>> already marked as isolated, so the page (freed by migration) won't end up on the
> >>>> pcplist again.
> >>>
> >>> So the page block is marked MIGRATE_ISOLATE but the allocation itself
> >>> could be used for non migrateable objects. Or does anything prevent that
> >>> from happening?
> >>
> >> In a movable zone, the allocation should not be used for non migrateable
> >> objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
> >> regardless of this race (analogically for migrating away from CMA pageblocks).
> >>
> >>> We really do depend on isolation to not allow reuse when offlining.
> >>
> >> This is not really different than if the page on pcplist was allocated just a
> >> moment before the offlining, thus isolation started. We ultimately rely on being
> >> able to migrate any allocated pages away during the isolation. This "freeing to
> >> pcplists" race doesn't fundamentally change anything in this regard. We just
> >> have to guarantee that pages on pcplists will be eventually flushed, to make
> >> forward progress, and there was a bug in this aspect.
> >
> > You are right. I managed to confuse myself yesterday. The race is
> > impossible for !ZONE_MOVABLE because we do PageBuddy check there. And on
> > the movable zone we are not losing the migrateability property.
> >
> > Pavel I think this will be a useful information to add to the changelog.
> > We should also document this in the code to prevent from further
> > confusion. I would suggest something like the following:
> >
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index 242c03121d73..56d4892bceb8 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> > * pageblocks we may have modified and return -EBUSY to caller. This
> > * prevents two threads from simultaneously working on overlapping ranges.
> > *
> > + * Please note that there is no strong synchronization with the page allocator
> > + * either. Pages might be freed while their page blocks are marked ISOLATED.
> > + * In some cases pages might still end up on pcp lists and that would allow
> > + * for their allocation even when they are in fact isolated already. Depending on
> > + * how strong of a guarantee the caller needs drain_all_pages might be needed
> > + * (e.g. __offline_pages will need to call it after check for isolated range for
> > + * a next retry).
> > + *
>
> As expressed in reply to v2, I dislike this hack. There is strong
> synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
> just plain ugly.
>
> Can't we temporarily disable PCP (while some pageblock in the zone is
> isolated, which we know e.g., due to the counter), so no new pages get
> put into PCP lists after draining, and re-enable after no pageblocks are
> isolated again? We keep draining the PCP, so it doesn't seem to be of a
> lot of use during that period, no? It's a performance hit already.
>
> Then, we would only need exactly one drain. And we would only have to
> check on the free path whether PCP is temporarily disabled.
Hm, we could use a static branches to disable it, that would keep
release code just as fast, but I am worried it will make code even
uglier. Let's see what others in this thread think about this idea.
Thank you,
Pasha
On 03.09.20 20:23, Pavel Tatashin wrote:
> On Thu, Sep 3, 2020 at 2:20 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 03.09.20 08:38, Michal Hocko wrote:
>>> On Wed 02-09-20 19:51:45, Vlastimil Babka wrote:
>>>> On 9/2/20 5:13 PM, Michal Hocko wrote:
>>>>> On Wed 02-09-20 16:55:05, Vlastimil Babka wrote:
>>>>>> On 9/2/20 4:26 PM, Pavel Tatashin wrote:
>>>>>>> On Wed, Sep 2, 2020 at 10:08 AM Michal Hocko <[email protected]> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thread#1 - continue
>>>>>>>>> free_unref_page_commit
>>>>>>>>> migratetype = get_pcppage_migratetype(page);
>>>>>>>>> // get old migration type
>>>>>>>>> list_add(&page->lru, &pcp->lists[migratetype]);
>>>>>>>>> // add new page to already drained pcp list
>>>>>>>>>
>>>>>>>>> Thread#2
>>>>>>>>> Never drains pcp again, and therefore gets stuck in the loop.
>>>>>>>>>
>>>>>>>>> The fix is to try to drain per-cpu lists again after
>>>>>>>>> check_pages_isolated_cb() fails.
>>>>>>>>
>>>>>>>> But this means that the page is not isolated and so it could be reused
>>>>>>>> for something else. No?
>>>>>>>
>>>>>>> The page is in a movable zone, has zero references, and the section is
>>>>>>> isolated (i.e. set_pageblock_migratetype(page, MIGRATE_ISOLATE);) is
>>>>>>> set. The page should be offlinable, but it is lost in a pcp list as
>>>>>>> that list is never drained again after the first failure to migrate
>>>>>>> all pages in the range.
>>>>>>
>>>>>> Yeah. To answer Michal's "it could be reused for something else" - yes, somebody
>>>>>> could allocate it from the pcplist before we do the extra drain. But then it
>>>>>> becomes "visible again" and the loop in __offline_pages() should catch it by
>>>>>> scan_movable_pages() - do_migrate_range(). And this time the pageblock is
>>>>>> already marked as isolated, so the page (freed by migration) won't end up on the
>>>>>> pcplist again.
>>>>>
>>>>> So the page block is marked MIGRATE_ISOLATE but the allocation itself
>>>>> could be used for non migrateable objects. Or does anything prevent that
>>>>> from happening?
>>>>
>>>> In a movable zone, the allocation should not be used for non migrateable
>>>> objects. E.g. if the zone was not ZONE_MOVABLE, the offlining could fail
>>>> regardless of this race (analogically for migrating away from CMA pageblocks).
>>>>
>>>>> We really do depend on isolation to not allow reuse when offlining.
>>>>
>>>> This is not really different than if the page on pcplist was allocated just a
>>>> moment before the offlining, thus isolation started. We ultimately rely on being
>>>> able to migrate any allocated pages away during the isolation. This "freeing to
>>>> pcplists" race doesn't fundamentally change anything in this regard. We just
>>>> have to guarantee that pages on pcplists will be eventually flushed, to make
>>>> forward progress, and there was a bug in this aspect.
>>>
>>> You are right. I managed to confuse myself yesterday. The race is
>>> impossible for !ZONE_MOVABLE because we do PageBuddy check there. And on
>>> the movable zone we are not losing the migrateability property.
>>>
>>> Pavel I think this will be a useful information to add to the changelog.
>>> We should also document this in the code to prevent from further
>>> confusion. I would suggest something like the following:
>>>
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index 242c03121d73..56d4892bceb8 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>>> * pageblocks we may have modified and return -EBUSY to caller. This
>>> * prevents two threads from simultaneously working on overlapping ranges.
>>> *
>>> + * Please note that there is no strong synchronization with the page allocator
>>> + * either. Pages might be freed while their page blocks are marked ISOLATED.
>>> + * In some cases pages might still end up on pcp lists and that would allow
>>> + * for their allocation even when they are in fact isolated already. Depending on
>>> + * how strong of a guarantee the caller needs drain_all_pages might be needed
>>> + * (e.g. __offline_pages will need to call it after check for isolated range for
>>> + * a next retry).
>>> + *
>>
>> As expressed in reply to v2, I dislike this hack. There is strong
>> synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
>> just plain ugly.
>>
>> Can't we temporarily disable PCP (while some pageblock in the zone is
>> isolated, which we know e.g., due to the counter), so no new pages get
>> put into PCP lists after draining, and re-enable after no pageblocks are
>> isolated again? We keep draining the PCP, so it doesn't seem to be of a
>> lot of use during that period, no? It's a performance hit already.
>>
>> Then, we would only need exactly one drain. And we would only have to
>> check on the free path whether PCP is temporarily disabled.
>
> Hm, we could use a static branches to disable it, that would keep
> release code just as fast, but I am worried it will make code even
> uglier. Let's see what others in this thread think about this idea.
It would at least stop this "allocate from MIOGRATE_ISOLATE" behavior
and the "drain when you feel like it, and drain more frequently to work
around broken PCP code" handling.
Anyhow, I'll be offline until next Tuesday, cheers!
--
Thanks,
David / dhildenb
On 9/3/20 8:23 PM, Pavel Tatashin wrote:
>>
>> As expressed in reply to v2, I dislike this hack. There is strong
>> synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
>> just plain ugly.
>>
>> Can't we temporarily disable PCP (while some pageblock in the zone is
>> isolated, which we know e.g., due to the counter), so no new pages get
>> put into PCP lists after draining, and re-enable after no pageblocks are
>> isolated again? We keep draining the PCP, so it doesn't seem to be of a
>> lot of use during that period, no? It's a performance hit already.
>>
>> Then, we would only need exactly one drain. And we would only have to
>> check on the free path whether PCP is temporarily disabled.
>
> Hm, we could use a static branches to disable it, that would keep
> release code just as fast, but I am worried it will make code even
> uglier. Let's see what others in this thread think about this idea.
Maybe we could just set pcp->high = 0 or something, make sure the
pcplist user only reads this value while irqs are disabled. Then the the
IPI in drain_all_pages() should guarantee there's nobody racing freeing
to pcplist. But careful to not introduce bugs like the one fixed with
[1]. And not sure if this guarantee survives when RT comes and replaces
the disabled irq's with local_lock or something.
[1]
https://lore.kernel.org/linux-mm/[email protected]/
> Thank you,
> Pasha
>
On Thu 03-09-20 20:31:04, David Hildenbrand wrote:
> On 03.09.20 20:23, Pavel Tatashin wrote:
> > On Thu, Sep 3, 2020 at 2:20 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 03.09.20 08:38, Michal Hocko wrote:
[...]
> >>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> >>> index 242c03121d73..56d4892bceb8 100644
> >>> --- a/mm/page_isolation.c
> >>> +++ b/mm/page_isolation.c
> >>> @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
> >>> * pageblocks we may have modified and return -EBUSY to caller. This
> >>> * prevents two threads from simultaneously working on overlapping ranges.
> >>> *
> >>> + * Please note that there is no strong synchronization with the page allocator
> >>> + * either. Pages might be freed while their page blocks are marked ISOLATED.
> >>> + * In some cases pages might still end up on pcp lists and that would allow
> >>> + * for their allocation even when they are in fact isolated already. Depending on
> >>> + * how strong of a guarantee the caller needs drain_all_pages might be needed
> >>> + * (e.g. __offline_pages will need to call it after check for isolated range for
> >>> + * a next retry).
> >>> + *
> >>
> >> As expressed in reply to v2, I dislike this hack. There is strong
> >> synchronization, just PCP is special. Allocating from MIGRATE_ISOLATE is
> >> just plain ugly.
Completely agreed! I am not happy about that either. But I believe this
hack is the easiest way forward for stable trees and as an immediate
fix. We can build on top of that of course.
> >> Can't we temporarily disable PCP (while some pageblock in the zone is
> >> isolated, which we know e.g., due to the counter), so no new pages get
> >> put into PCP lists after draining, and re-enable after no pageblocks are
> >> isolated again? We keep draining the PCP, so it doesn't seem to be of a
> >> lot of use during that period, no? It's a performance hit already.
This is a good point.
> >> Then, we would only need exactly one drain. And we would only have to
> >> check on the free path whether PCP is temporarily disabled.
> >
> > Hm, we could use a static branches to disable it, that would keep
> > release code just as fast, but I am worried it will make code even
> > uglier. Let's see what others in this thread think about this idea.
I know that static branches are a very effective way to enable/disable
features but I have no experience in how they perform for a very
shortlived use. Maybe that is just fine for a single place which needs
to be patched. This would be especially a problem if the static branch
is to be enabled from start_isolate_page_range because that includes all
cma allocator users.
Another alternative would be to enable/disable static branch only from
users who really care but this is quite tricky because how do you tell
you need or not? It seems that alloc_contig_range would be just fine
with a weaker semantic because it would "only" to a spurious failure.
Memory hotplug on the other hand really needs to have a point where
nobody interferes with the offlined memory so it could ask for a
stronger semantic.
Yet another option would be to make draining stronger and actually
guarantee there are no in-flight pages to be freed to the pcp list.
One way would be to tweak pcp->high and implement a strong barrier
(IPI?) to sync with all CPUs. Quite expensive, especially when there are
many draining requests (read cma users because hotplug doesn't really
matter much as it happens seldom).
So no nice&cheap solution I can think of...
--
Michal Hocko
SUSE Labs
> Another alternative would be to enable/disable static branch only from
> users who really care but this is quite tricky because how do you tell
> you need or not? It seems that alloc_contig_range would be just fine
> with a weaker semantic because it would "only" to a spurious failure.
> Memory hotplug on the other hand really needs to have a point where
> nobody interferes with the offlined memory so it could ask for a
> stronger semantic.
>
> Yet another option would be to make draining stronger and actually
> guarantee there are no in-flight pages to be freed to the pcp list.
> One way would be to tweak pcp->high and implement a strong barrier
> (IPI?) to sync with all CPUs. Quite expensive, especially when there are
> many draining requests (read cma users because hotplug doesn't really
> matter much as it happens seldom).
>
> So no nice&cheap solution I can think of...
I think start_isolate_page_range() should not be doing page draining
at all. It should isolate ranges, meaning set appropriate flags, but
draining should be performed by the users when appropriate: next to
lru_add_drain_all() calls both in CMA and hotplug.
Currently, the way start_isolate_page_range() drains pages is very
inefficient. It calls drain_all_pages() for every valid page block,
which is a slow call as it starts a thread per cpu, and waits for
those threads to finish before returning.
We could optimize by moving the drain_all_pages() calls from
set_migratetype_isolate() to start_isolate_page_range() and call it
once for every different zone, but both current users of this
interface guarantee that all pfns [start_pfn, end_pfn] are within the
same zone, and I think we should keep it this way, so again the extra
traversal is going to be overhead overhead.
This way we will have on average only a single drain per hot-remove
(instead of one per block), and also it is going to be symmetric only
in one place. Faster hot-remove and cma alloc, and no race, imo
win-win.
Pasha
On Fri 04-09-20 10:25:02, Pavel Tatashin wrote:
> > Another alternative would be to enable/disable static branch only from
> > users who really care but this is quite tricky because how do you tell
> > you need or not? It seems that alloc_contig_range would be just fine
> > with a weaker semantic because it would "only" to a spurious failure.
> > Memory hotplug on the other hand really needs to have a point where
> > nobody interferes with the offlined memory so it could ask for a
> > stronger semantic.
> >
> > Yet another option would be to make draining stronger and actually
> > guarantee there are no in-flight pages to be freed to the pcp list.
> > One way would be to tweak pcp->high and implement a strong barrier
> > (IPI?) to sync with all CPUs. Quite expensive, especially when there are
> > many draining requests (read cma users because hotplug doesn't really
> > matter much as it happens seldom).
> >
> > So no nice&cheap solution I can think of...
>
> I think start_isolate_page_range() should not be doing page draining
> at all. It should isolate ranges, meaning set appropriate flags, but
> draining should be performed by the users when appropriate: next to
> lru_add_drain_all() calls both in CMA and hotplug.
I disagree. The pcp draining is an implementation detail and we
shouldn't bother callers to be aware of it.
> Currently, the way start_isolate_page_range() drains pages is very
> inefficient. It calls drain_all_pages() for every valid page block,
> which is a slow call as it starts a thread per cpu, and waits for
> those threads to finish before returning.
This is an implementation detail.
> We could optimize by moving the drain_all_pages() calls from
> set_migratetype_isolate() to start_isolate_page_range() and call it
> once for every different zone, but both current users of this
> interface guarantee that all pfns [start_pfn, end_pfn] are within the
> same zone, and I think we should keep it this way, so again the extra
> traversal is going to be overhead overhead.
Again this just leads to tricky code. Just look at how easy it was to
break this by removing something that looked clearly a duplicate call.
It is true that memory isolation usage is limited to only few usecasaes
but I would strongly prefer to make the semantic clear so that we do not
repeat this regressions.
--
Michal Hocko
SUSE Labs