2021-02-17 13:20:39

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH 0/2] Make alloc_contig_range handle Hugetlb pages

Hi,

this is the new version of [1].

RFC -> v1:
- Drop RFC
- Addressed feedback from David and Mike
- Fence off gigantic pages as there is a cyclic dependency between
them and alloc_contig_range
- Re-organize the code to make race-window smaller and to put
all details in hugetlb code
- Drop nodemask initialization. First a node will be tried and then we
will back to other nodes containing memory (N_MEMORY). Details in
patch#1's changelog
- Count new page as surplus in case we failed to dissolve the old page
and the new one. Details in patch#1.

I already have a patch that uses the new alloc_and_dissolve() from hotplug
code, but I decided to leave that patch out until this patchset is settled.

[1] https://patchwork.kernel.org/project/linux-mm/list/?series=429833

Oscar Salvador (2):
mm: Make alloc_contig_range handle free hugetlb pages
mm: Make alloc_contig_range handle in-use hugetlb pages

include/linux/hugetlb.h | 7 ++++
mm/compaction.c | 21 +++++++++++
mm/hugetlb.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 5 +--
4 files changed, 124 insertions(+), 2 deletions(-)

--
2.16.3


2021-02-17 13:20:41

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

Free hugetlb pages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugetlb page, and if it
succeeds, it will dissolve the old one.

With regard to the allocation, since we do not know whether the old page
was allocated on a specific node on request, the node the old page belongs
to will be tried first, and then we will fallback to all nodes containing
memory (N_MEMORY).

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/hugetlb.h | 6 ++++
mm/compaction.c | 12 +++++++
mm/hugetlb.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 109 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..72352d718829 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -505,6 +505,7 @@ struct huge_bootmem_page {
struct hstate *hstate;
};

+bool isolate_or_dissolve_huge_page(struct page *page);
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -775,6 +776,11 @@ void set_page_huge_active(struct page *page);
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};

+static inline bool isolate_or_dissolve_huge_page(struct page *page)
+{
+ return false;
+}
+
static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr,
int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index 190ccdaa6c19..d52506ed9db7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -905,6 +905,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
valid_page = page;
}

+ if (PageHuge(page) && cc->alloc_contig) {
+ if (!isolate_or_dissolve_huge_page(page))
+ goto isolate_fail;
+
+ /*
+ * Ok, the hugepage was dissolved. Now these pages are
+ * Buddy and cannot be re-allocated because they are
+ * isolated. Fall-through as the check below handles
+ * Buddy pages.
+ */
+ }
+
/*
* Skip if free. We read page order here without zone lock
* which is generally unsafe, but the race window is small and
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..b78926bca60a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2294,6 +2294,97 @@ static void restore_reserve_on_error(struct hstate *h,
}
}

+static bool alloc_and_dissolve_huge_page(struct hstate *h, struct page *page)
+{
+ gfp_t gfp_mask = htlb_alloc_mask(h);
+ nodemask_t *nmask = &node_states[N_MEMORY];
+ struct page *new_page;
+ bool ret = false;
+ int nid;
+
+ spin_lock(&hugetlb_lock);
+ /*
+ * Check one more time to make race-window smaller.
+ */
+ if (!PageHuge(page)) {
+ /*
+ * Dissolved from under our feet.
+ */
+ spin_unlock(&hugetlb_lock);
+ return true;
+ }
+
+ nid = page_to_nid(page);
+ spin_unlock(&hugetlb_lock);
+
+ /*
+ * Before dissolving the page, we need to allocate a new one,
+ * so the pool remains stable.
+ */
+ new_page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
+ if (new_page) {
+ /*
+ * Ok, we got a new free hugepage to replace this one. Try to
+ * dissolve the old page.
+ */
+ if (!dissolve_free_huge_page(page)) {
+ ret = true;
+ } else if (dissolve_free_huge_page(new_page)) {
+ /*
+ * Seems the old page could not be dissolved, so try to
+ * dissolve the freshly allocated page. If that fails
+ * too, let us count the new page as a surplus. Doing so
+ * allows the pool to be re-balanced when pages are freed
+ * instead of enqueued again.
+ */
+ spin_lock(&hugetlb_lock);
+ h->surplus_huge_pages++;
+ h->surplus_huge_pages_node[nid]++;
+ spin_unlock(&hugetlb_lock);
+ }
+ /*
+ * Free it into the hugepage allocator
+ */
+ put_page(new_page);
+ }
+
+ return ret;
+}
+
+bool isolate_or_dissolve_huge_page(struct page *page)
+{
+ struct hstate *h = NULL;
+ struct page *head;
+ bool ret = false;
+
+ spin_lock(&hugetlb_lock);
+ if (PageHuge(page)) {
+ head = compound_head(page);
+ h = page_hstate(head);
+ }
+ spin_unlock(&hugetlb_lock);
+
+ if (!h)
+ /*
+ * The page might have been dissolved from under our feet.
+ * If that is the case, return success as if we dissolved it
+ * ourselves.
+ */
+ return true;
+
+ if (hstate_is_gigantic(h))
+ /*
+ * Fence off gigantic pages as there is a cyclic dependency
+ * between alloc_contig_range and them.
+ */
+ return ret;
+
+ if(!page_count(head) && alloc_and_dissolve_huge_page(h, head))
+ ret = true;
+
+ return ret;
+}
+
struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve)
{
--
2.16.3

2021-02-17 13:49:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On 17.02.21 14:30, Michal Hocko wrote:
> On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
>> Free hugetlb pages are tricky to handle so as to no userspace application
>> notices disruption, we need to replace the current free hugepage with
>> a new one.
>>
>> In order to do that, a new function called alloc_and_dissolve_huge_page
>> is introduced.
>> This function will first try to get a new fresh hugetlb page, and if it
>> succeeds, it will dissolve the old one.
>>
>> With regard to the allocation, since we do not know whether the old page
>> was allocated on a specific node on request, the node the old page belongs
>> to will be tried first, and then we will fallback to all nodes containing
>> memory (N_MEMORY).
>
> I do not think fallback to a different zone is ok. If yes then this
> really requires a very good reasoning. alloc_contig_range is an
> optimistic allocation interface at best and it shouldn't break carefully
> node aware preallocation done by administrator.

What does memory offlining do when migrating in-use hugetlbfs pages?
Does it always keep the node?

I think keeping the node is the easiest/simplest approach for now.

>
>> Note that gigantic hugetlb pages are fenced off since there is a cyclic
>> dependency between them and alloc_contig_range.
>
> Why do we need/want to do all this in the first place?

cma and virtio-mem (especially on ZONE_MOVABLE) really want to handle
hugetlbfs pages.

--
Thanks,

David / dhildenb

2021-02-17 13:50:54

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Wed, Feb 17, 2021 at 02:30:43PM +0100, Michal Hocko wrote:
> On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
> I do not think fallback to a different zone is ok. If yes then this
> really requires a very good reasoning. alloc_contig_range is an
> optimistic allocation interface at best and it shouldn't break carefully
> node aware preallocation done by administrator.

Yeah, previous version (RFC) was more careful with that.
I somehow thought that it might be ok to fallback to other nodes in case
we failed to allocate on the preferred nid.

I will get RFC handling back wrt. allocation once I gather more feedback.

>
> > Note that gigantic hugetlb pages are fenced off since there is a cyclic
> > dependency between them and alloc_contig_range.
>
> Why do we need/want to do all this in the first place?

When trying to allocate a memory chunk with alloc_contig_range, it will fail
if it ever sees a Hugetlb page because isolate_migratepages_range() does not
"recognize" those for migration (or to put it different, does not know about
them).

Given that HugeTLB pages can be migrated or dissolved if free, it makes sense
to enable isolate_migratepages_range() to recognize HugeTLB pages in order
to handle them, as it currently does with LRU and Movable pages.

--
Oscar Salvador
SUSE L3

2021-02-17 13:55:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Wed 17-02-21 14:36:47, David Hildenbrand wrote:
> On 17.02.21 14:30, Michal Hocko wrote:
> > On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
> > > Free hugetlb pages are tricky to handle so as to no userspace application
> > > notices disruption, we need to replace the current free hugepage with
> > > a new one.
> > >
> > > In order to do that, a new function called alloc_and_dissolve_huge_page
> > > is introduced.
> > > This function will first try to get a new fresh hugetlb page, and if it
> > > succeeds, it will dissolve the old one.
> > >
> > > With regard to the allocation, since we do not know whether the old page
> > > was allocated on a specific node on request, the node the old page belongs
> > > to will be tried first, and then we will fallback to all nodes containing
> > > memory (N_MEMORY).
> >
> > I do not think fallback to a different zone is ok. If yes then this
> > really requires a very good reasoning. alloc_contig_range is an
> > optimistic allocation interface at best and it shouldn't break carefully
> > node aware preallocation done by administrator.
>
> What does memory offlining do when migrating in-use hugetlbfs pages? Does it
> always keep the node?

No it will break the node pool. The reasoning behind that is that
offlining is an explicit request from the userspace and it is expected
to break affinities because it is a destructive action from the memory
capacity point of view. It is impossible to have former affinity while
you are cutting the memory off under its user.

> I think keeping the node is the easiest/simplest approach for now.
>
> >
> > > Note that gigantic hugetlb pages are fenced off since there is a cyclic
> > > dependency between them and alloc_contig_range.
> >
> > Why do we need/want to do all this in the first place?
>
> cma and virtio-mem (especially on ZONE_MOVABLE) really want to handle
> hugetlbfs pages.

Do we have any real life examples? Or does this fall more into, let's
optimize an existing implementation category.
--
Michal Hocko
SUSE Labs

2021-02-17 14:02:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Wed 17-02-21 14:53:37, David Hildenbrand wrote:
> On 17.02.21 14:50, Michal Hocko wrote:
[...]
> > Do we have any real life examples? Or does this fall more into, let's
> > optimize an existing implementation category.
> >
>
> It's a big TODO item I have on my list and I am happy that Oscar is looking
> into it. So yes, I noticed it while working on virtio-mem. It's real.

Do not take me wrong, I am not opposing to the functionality. I am
asking for the specific usecase.

--
Michal Hocko
SUSE Labs

2021-02-17 14:15:02

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On 17.02.21 14:59, Michal Hocko wrote:
> On Wed 17-02-21 14:53:37, David Hildenbrand wrote:
>> On 17.02.21 14:50, Michal Hocko wrote:
> [...]
>>> Do we have any real life examples? Or does this fall more into, let's
>>> optimize an existing implementation category.
>>>
>>
>> It's a big TODO item I have on my list and I am happy that Oscar is looking
>> into it. So yes, I noticed it while working on virtio-mem. It's real.
>
> Do not take me wrong, I am not opposing to the functionality. I am
> asking for the specific usecase.

Makes sense, and a proper motivation should be included in the
patches/cover letter. So here comes a quick-n-dirty example:


Start a VM with 4G. Hotplug 1G via virtio-mem and online it to
ZONE_MOVABLE. Allocate 512 huge pages.

[root@localhost ~]# cat /proc/meminfo
MemTotal: 5061512 kB
MemFree: 3319396 kB
MemAvailable: 3457144 kB
...
HugePages_Total: 512
HugePages_Free: 512
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB


The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging
1G via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:

[ 180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[ 180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[ 180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[ 180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[ 180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
[ 180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[ 180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[ 180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[ 180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
[ 180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy


I succeed in unplugging 540MB - 484 MB remain blocked by huge pages
("which did not end up there by pure luck"). These pages are movable
(and even free!) and can easily be reallocated.

--
Thanks,

David / dhildenb

2021-02-17 14:17:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Wed 17-02-21 15:08:04, David Hildenbrand wrote:
> On 17.02.21 14:59, Michal Hocko wrote:
> > On Wed 17-02-21 14:53:37, David Hildenbrand wrote:
> > > On 17.02.21 14:50, Michal Hocko wrote:
> > [...]
> > > > Do we have any real life examples? Or does this fall more into, let's
> > > > optimize an existing implementation category.
> > > >
> > >
> > > It's a big TODO item I have on my list and I am happy that Oscar is looking
> > > into it. So yes, I noticed it while working on virtio-mem. It's real.
> >
> > Do not take me wrong, I am not opposing to the functionality. I am
> > asking for the specific usecase.
>
> Makes sense, and a proper motivation should be included in the patches/cover
> letter. So here comes a quick-n-dirty example:
>
>
> Start a VM with 4G. Hotplug 1G via virtio-mem and online it to ZONE_MOVABLE.
> Allocate 512 huge pages.
>
> [root@localhost ~]# cat /proc/meminfo
> MemTotal: 5061512 kB
> MemFree: 3319396 kB
> MemAvailable: 3457144 kB
> ...
> HugePages_Total: 512
> HugePages_Free: 512
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
>
>
> The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging 1G
> via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:
>
> [ 180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [ 180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [ 180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [ 180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [ 180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [ 180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [ 180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [ 180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [ 180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [ 180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
>
>
> I succeed in unplugging 540MB - 484 MB remain blocked by huge pages ("which
> did not end up there by pure luck"). These pages are movable (and even
> free!) and can easily be reallocated.

OK, this sounds reasonable.
--
Michal Hocko
SUSE Labs

2021-02-17 14:27:09

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Wed, Feb 17, 2021 at 03:08:04PM +0100, David Hildenbrand wrote:
> On 17.02.21 14:59, Michal Hocko wrote:
> > On Wed 17-02-21 14:53:37, David Hildenbrand wrote:
> > > On 17.02.21 14:50, Michal Hocko wrote:
> > [...]
> > > > Do we have any real life examples? Or does this fall more into, let's
> > > > optimize an existing implementation category.
> > > >
> > >
> > > It's a big TODO item I have on my list and I am happy that Oscar is looking
> > > into it. So yes, I noticed it while working on virtio-mem. It's real.
> >
> > Do not take me wrong, I am not opposing to the functionality. I am
> > asking for the specific usecase.
>
> Makes sense, and a proper motivation should be included in the patches/cover
> letter. So here comes a quick-n-dirty example:

Definitely. I took it for granted that the problem was obvious.

> Start a VM with 4G. Hotplug 1G via virtio-mem and online it to ZONE_MOVABLE.
> Allocate 512 huge pages.
>
> [root@localhost ~]# cat /proc/meminfo
> MemTotal: 5061512 kB
> MemFree: 3319396 kB
> MemAvailable: 3457144 kB
> ...
> HugePages_Total: 512
> HugePages_Free: 512
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
>
>
> The huge pages get partially allocate from ZONE_MOVABLE. Try unplugging 1G
> via virtio-mem (remember, all ZONE_MOVABLE). Inside the guest:
>
> [ 180.058992] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [ 180.060531] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [ 180.061972] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [ 180.063413] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [ 180.064838] alloc_contig_range: [1b8000, 1c0000) PFNs busy
> [ 180.065848] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [ 180.066794] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [ 180.067738] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [ 180.068669] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
> [ 180.069598] alloc_contig_range: [1bfc00, 1c0000) PFNs busy
>
>
> I succeed in unplugging 540MB - 484 MB remain blocked by huge pages ("which
> did not end up there by pure luck"). These pages are movable (and even
> free!) and can easily be reallocated.

Thanks for providing such a detailed case David.
I will gather the bits and prepare a v2 once I gather more feedback.

--
Oscar Salvador
SUSE L3

2021-02-17 15:05:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
[...]
> +static bool alloc_and_dissolve_huge_page(struct hstate *h, struct page *page)
> +{
> + gfp_t gfp_mask = htlb_alloc_mask(h);
> + nodemask_t *nmask = &node_states[N_MEMORY];
> + struct page *new_page;
> + bool ret = false;
> + int nid;
> +
> + spin_lock(&hugetlb_lock);
> + /*
> + * Check one more time to make race-window smaller.
> + */
> + if (!PageHuge(page)) {
> + /*
> + * Dissolved from under our feet.
> + */
> + spin_unlock(&hugetlb_lock);
> + return true;
> + }

Is this really necessary? dissolve_free_huge_page will take care of this
and the race windown you are covering is really tiny.

> +
> + nid = page_to_nid(page);
> + spin_unlock(&hugetlb_lock);
> +
> + /*
> + * Before dissolving the page, we need to allocate a new one,
> + * so the pool remains stable.
> + */
> + new_page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);

wrt. fallback to other zones, I haven't realized that the primary
usecase is a form of memory offlining (from virt-mem). I am not yet sure
what the proper behavior is in that case but if breaking hugetlb pools,
similar to the normal hotplug operation, is viable then this needs a
special mode. We do not want a random alloc_contig_range user to do the
same. So for starter I would go with __GFP_THISNODE here.

> + if (new_page) {
> + /*
> + * Ok, we got a new free hugepage to replace this one. Try to
> + * dissolve the old page.
> + */
> + if (!dissolve_free_huge_page(page)) {
> + ret = true;
> + } else if (dissolve_free_huge_page(new_page)) {
> + /*
> + * Seems the old page could not be dissolved, so try to
> + * dissolve the freshly allocated page. If that fails
> + * too, let us count the new page as a surplus. Doing so
> + * allows the pool to be re-balanced when pages are freed
> + * instead of enqueued again.
> + */
> + spin_lock(&hugetlb_lock);
> + h->surplus_huge_pages++;
> + h->surplus_huge_pages_node[nid]++;
> + spin_unlock(&hugetlb_lock);
> + }
> + /*
> + * Free it into the hugepage allocator
> + */
> + put_page(new_page);
> + }
> +
> + return ret;
> +}
> +
> +bool isolate_or_dissolve_huge_page(struct page *page)
> +{
> + struct hstate *h = NULL;
> + struct page *head;
> + bool ret = false;
> +
> + spin_lock(&hugetlb_lock);
> + if (PageHuge(page)) {
> + head = compound_head(page);
> + h = page_hstate(head);
> + }
> + spin_unlock(&hugetlb_lock);
> +
> + if (!h)
> + /*
> + * The page might have been dissolved from under our feet.
> + * If that is the case, return success as if we dissolved it
> + * ourselves.
> + */
> + return true;

nit I would put the comment above the conditin for both cases. It reads
more easily that way. At least without { }.

> +
> + if (hstate_is_gigantic(h))
> + /*
> + * Fence off gigantic pages as there is a cyclic dependency
> + * between alloc_contig_range and them.
> + */
> + return ret;
> +
> + if(!page_count(head) && alloc_and_dissolve_huge_page(h, head))
> + ret = true;
> +
> + return ret;
> +}
> +
> struct page *alloc_huge_page(struct vm_area_struct *vma,
> unsigned long addr, int avoid_reserve)
> {

Other than that I haven't noticed any surprises.
--
Michal Hocko
SUSE Labs

2021-02-17 18:04:08

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
> Free hugetlb pages are tricky to handle so as to no userspace application
> notices disruption, we need to replace the current free hugepage with
> a new one.
>
> In order to do that, a new function called alloc_and_dissolve_huge_page
> is introduced.
> This function will first try to get a new fresh hugetlb page, and if it
> succeeds, it will dissolve the old one.
>
> With regard to the allocation, since we do not know whether the old page
> was allocated on a specific node on request, the node the old page belongs
> to will be tried first, and then we will fallback to all nodes containing
> memory (N_MEMORY).

I do not think fallback to a different zone is ok. If yes then this
really requires a very good reasoning. alloc_contig_range is an
optimistic allocation interface at best and it shouldn't break carefully
node aware preallocation done by administrator.

> Note that gigantic hugetlb pages are fenced off since there is a cyclic
> dependency between them and alloc_contig_range.

Why do we need/want to do all this in the first place?
--
Michal Hocko
SUSE Labs

2021-02-17 18:10:53

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On 17.02.21 14:50, Michal Hocko wrote:
> On Wed 17-02-21 14:36:47, David Hildenbrand wrote:
>> On 17.02.21 14:30, Michal Hocko wrote:
>>> On Wed 17-02-21 11:08:15, Oscar Salvador wrote:
>>>> Free hugetlb pages are tricky to handle so as to no userspace application
>>>> notices disruption, we need to replace the current free hugepage with
>>>> a new one.
>>>>
>>>> In order to do that, a new function called alloc_and_dissolve_huge_page
>>>> is introduced.
>>>> This function will first try to get a new fresh hugetlb page, and if it
>>>> succeeds, it will dissolve the old one.
>>>>
>>>> With regard to the allocation, since we do not know whether the old page
>>>> was allocated on a specific node on request, the node the old page belongs
>>>> to will be tried first, and then we will fallback to all nodes containing
>>>> memory (N_MEMORY).
>>>
>>> I do not think fallback to a different zone is ok. If yes then this
>>> really requires a very good reasoning. alloc_contig_range is an
>>> optimistic allocation interface at best and it shouldn't break carefully
>>> node aware preallocation done by administrator.
>>
>> What does memory offlining do when migrating in-use hugetlbfs pages? Does it
>> always keep the node?
>
> No it will break the node pool. The reasoning behind that is that
> offlining is an explicit request from the userspace and it is expected

userspace? in 99,9996% it's the hardware that triggers the unplug of a DIMM.

>
>> I think keeping the node is the easiest/simplest approach for now.
>>
>>>
>>>> Note that gigantic hugetlb pages are fenced off since there is a cyclic
>>>> dependency between them and alloc_contig_range.
>>>
>>> Why do we need/want to do all this in the first place?
>>
>> cma and virtio-mem (especially on ZONE_MOVABLE) really want to handle
>> hugetlbfs pages.
>
> Do we have any real life examples? Or does this fall more into, let's
> optimize an existing implementation category.
>

It's a big TODO item I have on my list and I am happy that Oscar is
looking into it. So yes, I noticed it while working on virtio-mem. It's
real.

--
Thanks,

David / dhildenb

2021-02-18 11:31:15

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Wed, Feb 17, 2021 at 04:00:11PM +0100, Michal Hocko wrote:
> Is this really necessary? dissolve_free_huge_page will take care of this
> and the race windown you are covering is really tiny.

Probably not, I was trying to shrink to race window as much as possible
but the call to dissolve_free_huge_page might be enough.

> > + nid = page_to_nid(page);
> > + spin_unlock(&hugetlb_lock);
> > +
> > + /*
> > + * Before dissolving the page, we need to allocate a new one,
> > + * so the pool remains stable.
> > + */
> > + new_page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
>
> wrt. fallback to other zones, I haven't realized that the primary
> usecase is a form of memory offlining (from virt-mem). I am not yet sure
> what the proper behavior is in that case but if breaking hugetlb pools,
> similar to the normal hotplug operation, is viable then this needs a
> special mode. We do not want a random alloc_contig_range user to do the
> same. So for starter I would go with __GFP_THISNODE here.

Ok, makes sense.
__GFP_THISNODE will not allow fallback to other node's zones.
Since we only allow the nid the page belongs to, nodemask should be
NULL, right?

> > + if (!h)
> > + /*
> > + * The page might have been dissolved from under our feet.
> > + * If that is the case, return success as if we dissolved it
> > + * ourselves.
> > + */
> > + return true;
>
> nit I would put the comment above the conditin for both cases. It reads
> more easily that way. At least without { }.

Yes, makes sense.

> Other than that I haven't noticed any surprises.

I did. The 'put_page' call should be placed above, right after getting
the page. Otherwise, refcount == 1 and we will fail to dissolve the
new page if we need to (in case old page fails to be dissolved).
I already fixed that locally.

--
Oscar Salvador
SUSE L3

2021-02-18 15:05:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Thu 18-02-21 11:09:17, Oscar Salvador wrote:
> On Wed, Feb 17, 2021 at 04:00:11PM +0100, Michal Hocko wrote:
> > Is this really necessary? dissolve_free_huge_page will take care of this
> > and the race windown you are covering is really tiny.
>
> Probably not, I was trying to shrink to race window as much as possible
> but the call to dissolve_free_huge_page might be enough.
>
> > > + nid = page_to_nid(page);
> > > + spin_unlock(&hugetlb_lock);
> > > +
> > > + /*
> > > + * Before dissolving the page, we need to allocate a new one,
> > > + * so the pool remains stable.
> > > + */
> > > + new_page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
> >
> > wrt. fallback to other zones, I haven't realized that the primary
> > usecase is a form of memory offlining (from virt-mem). I am not yet sure
> > what the proper behavior is in that case but if breaking hugetlb pools,
> > similar to the normal hotplug operation, is viable then this needs a
> > special mode. We do not want a random alloc_contig_range user to do the
> > same. So for starter I would go with __GFP_THISNODE here.
>
> Ok, makes sense.
> __GFP_THISNODE will not allow fallback to other node's zones.
> Since we only allow the nid the page belongs to, nodemask should be
> NULL, right?

I would have to double check because hugetlb has a slightly different
expectations from nodemask than the page allocator. The later translates
that to all possible nodes but hugetlb API tries to dereference nodes.
Maybe THIS node special cases it somewhere.

> > > + if (!h)
> > > + /*
> > > + * The page might have been dissolved from under our feet.
> > > + * If that is the case, return success as if we dissolved it
> > > + * ourselves.
> > > + */
> > > + return true;
> >
> > nit I would put the comment above the conditin for both cases. It reads
> > more easily that way. At least without { }.
>
> Yes, makes sense.
>
> > Other than that I haven't noticed any surprises.
>
> I did. The 'put_page' call should be placed above, right after getting
> the page. Otherwise, refcount == 1 and we will fail to dissolve the
> new page if we need to (in case old page fails to be dissolved).
> I already fixed that locally.

I am not sure I follow. newly allocated pages is unreferenced
unconditionally and the old page is not referenced by this path.

--
Michal Hocko
SUSE Labs

2021-02-18 16:17:39

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Thu, Feb 18, 2021 at 01:52:38PM +0100, Michal Hocko wrote:
> > Ok, makes sense.
> > __GFP_THISNODE will not allow fallback to other node's zones.
> > Since we only allow the nid the page belongs to, nodemask should be
> > NULL, right?
>
> I would have to double check because hugetlb has a slightly different
> expectations from nodemask than the page allocator. The later translates
> that to all possible nodes but hugetlb API tries to dereference nodes.
> Maybe THIS node special cases it somewhere.

Uhm, I do not quite follow here.
AFAICS, alloc_fresh_huge_page->alloc_buddy_huge_page does nothing
with the nodemask, bur rather with nodes_retry mask. That is done
to not retry on a node we failed to allocate a page.

Now, alloc_buddy_huge_page calls __alloc_pages_nodemask directly.
If my understanding is correct, it is ok to have a null nodemask
as __next_zones_zonelist() will go through our own zonelist,
since __GFP_THISNODE made us take ZONELIST_NOFALLBACK.

Actually, I do not see how passing a non-null nodemask migth have
helped there, unless we allow to specify more nodes.

> > I did. The 'put_page' call should be placed above, right after getting
> > the page. Otherwise, refcount == 1 and we will fail to dissolve the
> > new page if we need to (in case old page fails to be dissolved).
> > I already fixed that locally.
>
> I am not sure I follow. newly allocated pages is unreferenced
> unconditionally and the old page is not referenced by this path.

Current code is:

allocate_a_new_page (new_page's refcount = 1)
dissolve_old_page
: if fail
dissolve_new_page (we cannot dissolve it refcount != 0)
put_page(new_page);

It should be:

allocate_a_new_page (new_page's refcount = 1)
put_page(new_page); (new_page's refcount = 0)
dissolve_old_page
: if fail
dissolve_new_page (we can dissolve it as refcount == 0)

I hope this clarifies it .

--
Oscar Salvador
SUSE L3

2021-02-18 16:46:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Thu 18-02-21 14:32:50, Oscar Salvador wrote:
> On Thu, Feb 18, 2021 at 01:52:38PM +0100, Michal Hocko wrote:
> > > Ok, makes sense.
> > > __GFP_THISNODE will not allow fallback to other node's zones.
> > > Since we only allow the nid the page belongs to, nodemask should be
> > > NULL, right?
> >
> > I would have to double check because hugetlb has a slightly different
> > expectations from nodemask than the page allocator. The later translates
> > that to all possible nodes but hugetlb API tries to dereference nodes.
> > Maybe THIS node special cases it somewhere.
>
> Uhm, I do not quite follow here.
> AFAICS, alloc_fresh_huge_page->alloc_buddy_huge_page does nothing
> with the nodemask, bur rather with nodes_retry mask. That is done
> to not retry on a node we failed to allocate a page.
>
> Now, alloc_buddy_huge_page calls __alloc_pages_nodemask directly.
> If my understanding is correct, it is ok to have a null nodemask
> as __next_zones_zonelist() will go through our own zonelist,
> since __GFP_THISNODE made us take ZONELIST_NOFALLBACK.

As I've said. Page allocator can cope with NULL nodemask just fine.
I have checked the code and now remember the tricky part. It is
alloc_gigantic_page which cannot work with NULL nodemask because it
relies on for_each_node_mask and that, unlike zonelist iterator, cannot
cope with NULL node mask. This is the case only for !GFP_THISNODE.

> Actually, I do not see how passing a non-null nodemask migth have
> helped there, unless we allow to specify more nodes.

No, nodemask is doesn't make any difference.

> > > I did. The 'put_page' call should be placed above, right after getting
> > > the page. Otherwise, refcount == 1 and we will fail to dissolve the
> > > new page if we need to (in case old page fails to be dissolved).
> > > I already fixed that locally.
> >
> > I am not sure I follow. newly allocated pages is unreferenced
> > unconditionally and the old page is not referenced by this path.
>
> Current code is:
>
> allocate_a_new_page (new_page's refcount = 1)
> dissolve_old_page
> : if fail
> dissolve_new_page (we cannot dissolve it refcount != 0)
> put_page(new_page);

OK, new_page would go to the pool rather than get freed as this is
neither a surplus nnor a temporary page.

> It should be:
>
> allocate_a_new_page (new_page's refcount = 1)
> put_page(new_page); (new_page's refcount = 0)
> dissolve_old_page
> : if fail
> dissolve_new_page (we can dissolve it as refcount == 0)
>
> I hope this clarifies it .

OK, I see the problem now. And your above solution is not really
optimal either. Your put_page would add the page to the pool and so it
could be used by somebody. One way around it would be either directly
manipulating reference count which is fugly or you can make it a
temporal page (alloc_migrate_huge_page) or maybe even better not special
case this here but rather allow migrating free hugetlb pages in the
migrate_page path.
--
Michal Hocko
SUSE Labs

2021-02-18 18:54:33

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On 2021-02-18 14:59, Michal Hocko wrote:
> As I've said. Page allocator can cope with NULL nodemask just fine.
> I have checked the code and now remember the tricky part. It is
> alloc_gigantic_page which cannot work with NULL nodemask because it
> relies on for_each_node_mask and that, unlike zonelist iterator, cannot
> cope with NULL node mask. This is the case only for !GFP_THISNODE.

Ok, thanks for the clarification.

> OK, I see the problem now. And your above solution is not really
> optimal either. Your put_page would add the page to the pool and so it
> could be used by somebody. One

Yeah, that is right.

> way around it would be either directly
> manipulating reference count which is fugly or you can make it a
> temporal page (alloc_migrate_huge_page) or maybe even better not
> special
> case this here but rather allow migrating free hugetlb pages in the
> migrate_page path.

I will have a look into it to see how what would look.

Thanks for the feedback

--
Oscar Salvador
SUSE L3

2021-02-19 09:14:48

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Thu, Feb 18, 2021 at 02:59:40PM +0100, Michal Hocko wrote:
> > It should be:
> >
> > allocate_a_new_page (new_page's refcount = 1)
> > put_page(new_page); (new_page's refcount = 0)
> > dissolve_old_page
> > : if fail
> > dissolve_new_page (we can dissolve it as refcount == 0)
> >
> > I hope this clarifies it .
>
> OK, I see the problem now. And your above solution is not really
> optimal either. Your put_page would add the page to the pool and so it
> could be used by somebody. One way around it would be either directly
> manipulating reference count which is fugly or you can make it a
> temporal page (alloc_migrate_huge_page) or maybe even better not special
> case this here but rather allow migrating free hugetlb pages in the
> migrate_page path.

I have been weighting up this option because it seemed the most clean way to
proceed. Having the hability to migrate free hugepages directly from migrate_page
would spare us this function.
But there is a problem. migrate_pages needs the pages to be on a list (by
page->lru). That is no problem for used pages, but for freehugepages we would
have to remove the page from hstate->hugepage_freelists, meaning that if userspace
comes along and tries to get a hugepage (a hugepage he previously allocated by
e.g: /proc/sys/.../nr_hugepages), it will fail.

So I am not really sure we can go this way. Unless we are willing to accept
that temporary userspace can get ENOMEM if it tries to use a hugepage, which
I do not think it is a good idea.
Another way to go would be to make up for the free hugepages to be migrated and
allocate the same amount, but that starts to go down a rabbit hole.

I yet need to give it some more spins, but all in all, I think the easiest way
forward way might be to do something like:

alloc_and_dissolve_huge_page {

new_page = alloc_fresh_huge_page(h, gfp_mask, nid, NULL, NULL);
if (new_page) {
/*
* Put the page in the freelist hugepage pool.
* We might race with someone coming by and grabbing the page,
* but that is fine since we mark the page as Temporary in case
* both old and new_page fail to be dissolved, so new_page
* will be freed when its last reference is gone.
*/
put_page(new_page);

if (!dissolve_free_huge_page(page)) {
/*
* Old page could be dissolved.
*/
ret = true;
} else if (dissolve_free_huge_page(new_page)) {
/*
* Page might have been dissolved by admin by doing
* "echo 0 > /proc/../nr_hugepages". Check it before marking
* the page.
*/
spin_lock(&hugetlb_lock);
/* Mark the page Temporary in case we fail to dissolve both
* the old page and new_page. It will be freed when the last
* user drops it.
*/
if (PageHuge(new_page))
SetPageHugeTemporary(new_page);
spin_unlock(&hugetlb_lock);
}
}

There is one more thing to cover though.
Between the dissolve_free_huge_page(new_page) and the PageHuge check, the
page might have been freed, and so enqueued (because we did not mark it as
Temporary yet). If that is the case, we would later mark the page as Temporary
being in the freepool, not nice.

One way out would be to pass a boolean parameter to dissolve_free_huge_page(),
that tells whether the page should be marked Temporary in case the operation fails.
That would ease things a lot because __everything__ is being done under the lock,
which means the page cannot go away in the meantime.

Just my thoughts.

What do you think?


--
Oscar Salvador
SUSE L3

2021-02-19 09:58:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Fri 19-02-21 10:05:53, Oscar Salvador wrote:
> On Thu, Feb 18, 2021 at 02:59:40PM +0100, Michal Hocko wrote:
> > > It should be:
> > >
> > > allocate_a_new_page (new_page's refcount = 1)
> > > put_page(new_page); (new_page's refcount = 0)
> > > dissolve_old_page
> > > : if fail
> > > dissolve_new_page (we can dissolve it as refcount == 0)
> > >
> > > I hope this clarifies it .
> >
> > OK, I see the problem now. And your above solution is not really
> > optimal either. Your put_page would add the page to the pool and so it
> > could be used by somebody. One way around it would be either directly
> > manipulating reference count which is fugly or you can make it a
> > temporal page (alloc_migrate_huge_page) or maybe even better not special
> > case this here but rather allow migrating free hugetlb pages in the
> > migrate_page path.
>
> I have been weighting up this option because it seemed the most clean way to
> proceed. Having the hability to migrate free hugepages directly from migrate_page
> would spare us this function.
> But there is a problem. migrate_pages needs the pages to be on a list (by
> page->lru). That is no problem for used pages, but for freehugepages we would
> have to remove the page from hstate->hugepage_freelists, meaning that if userspace
> comes along and tries to get a hugepage (a hugepage he previously allocated by
> e.g: /proc/sys/.../nr_hugepages), it will fail.

Good point. I should have realized that.

> So I am not really sure we can go this way. Unless we are willing to accept
> that temporary userspace can get ENOMEM if it tries to use a hugepage, which
> I do not think it is a good idea.

No, this is not acceptable.

> Another way to go would be to make up for the free hugepages to be migrated and
> allocate the same amount, but that starts to go down a rabbit hole.
>
> I yet need to give it some more spins, but all in all, I think the easiest way
> forward way might be to do something like:
>
> alloc_and_dissolve_huge_page {
>
> new_page = alloc_fresh_huge_page(h, gfp_mask, nid, NULL, NULL);
> if (new_page) {
> /*
> * Put the page in the freelist hugepage pool.
> * We might race with someone coming by and grabbing the page,
> * but that is fine since we mark the page as Temporary in case
> * both old and new_page fail to be dissolved, so new_page
> * will be freed when its last reference is gone.
> */
> put_page(new_page);
>
> if (!dissolve_free_huge_page(page)) {
> /*
> * Old page could be dissolved.
> */
> ret = true;
> } else if (dissolve_free_huge_page(new_page)) {
> /*
> * Page might have been dissolved by admin by doing
> * "echo 0 > /proc/../nr_hugepages". Check it before marking
> * the page.
> */
> spin_lock(&hugetlb_lock);
> /* Mark the page Temporary in case we fail to dissolve both
> * the old page and new_page. It will be freed when the last
> * user drops it.
> */
> if (PageHuge(new_page))
> SetPageHugeTemporary(new_page);
> spin_unlock(&hugetlb_lock);
> }
> }

OK, this should work but I am really wondering whether it wouldn't be
just simpler to replace the old page by a new one in the free list
directly. Or is there any reason we have to go through the generic
helpers path? I mean something like this

new_page = alloc_fresh_huge_page();
if (!new_page)
goto fail;
spin_lock(hugetlb_lock);
if (!PageHuge(old_page)) {
/* freed from under us, nothing to do */
__update_and_free_page(new_page);
goto unlock;
}
list_del(&old_page->lru);
__update_and_free_page(old_page);
__enqueue_huge_page(new_page);
unlock:
spin_unlock(hugetlb_lock);

This will require to split update_and_free_page and enqueue_huge_page to
counters independent parts but that shouldn't be a big deal. But it will
also protect from any races. Not an act of beauty but seems less hackish
to me.
--
Michal Hocko
SUSE Labs

2021-02-19 10:19:19

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Fri, Feb 19, 2021 at 10:56:42AM +0100, Michal Hocko wrote:
> OK, this should work but I am really wondering whether it wouldn't be
> just simpler to replace the old page by a new one in the free list
> directly. Or is there any reason we have to go through the generic
> helpers path? I mean something like this
>
> new_page = alloc_fresh_huge_page();
> if (!new_page)
> goto fail;
> spin_lock(hugetlb_lock);
> if (!PageHuge(old_page)) {
> /* freed from under us, nothing to do */
> __update_and_free_page(new_page);
> goto unlock;
> }
> list_del(&old_page->lru);
> __update_and_free_page(old_page);
> __enqueue_huge_page(new_page);
> unlock:
> spin_unlock(hugetlb_lock);
>
> This will require to split update_and_free_page and enqueue_huge_page to
> counters independent parts but that shouldn't be a big deal. But it will
> also protect from any races. Not an act of beauty but seems less hackish
> to me.

Yes, I think this would to the trick, and it is race-free.
Let me play with it a bit and see what I can come up with.

Thanks for the valuable insight.

--
Oscar Salvador
SUSE L3

2021-02-19 10:44:49

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Fri, Feb 19, 2021 at 10:56:42AM +0100, Michal Hocko wrote:
> OK, this should work but I am really wondering whether it wouldn't be
> just simpler to replace the old page by a new one in the free list
> directly. Or is there any reason we have to go through the generic
> helpers path? I mean something like this
>
> new_page = alloc_fresh_huge_page();
> if (!new_page)
> goto fail;
> spin_lock(hugetlb_lock);
> if (!PageHuge(old_page)) {
> /* freed from under us, nothing to do */
> __update_and_free_page(new_page);
> goto unlock;
> }
> list_del(&old_page->lru);
> __update_and_free_page(old_page);
> __enqueue_huge_page(new_page);
> unlock:
> spin_unlock(hugetlb_lock);
>
> This will require to split update_and_free_page and enqueue_huge_page to
> counters independent parts but that shouldn't be a big deal. But it will
> also protect from any races. Not an act of beauty but seems less hackish
> to me.

On a closer look, do we really need to decouple update_and_free_page and
enqueue_huge_page? These two functions do not handle the lock, but rather
the functions that call them (as would be in our case).
Only update_and_free_page drops the lock during the freeing of a gigantic page
and then it takes it again, as the caller is who took the lock.

am I missing anything obvious here?

--
Oscar Salvador
SUSE L3

2021-02-19 11:02:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Fri 19-02-21 11:40:30, Oscar Salvador wrote:
> On Fri, Feb 19, 2021 at 10:56:42AM +0100, Michal Hocko wrote:
> > OK, this should work but I am really wondering whether it wouldn't be
> > just simpler to replace the old page by a new one in the free list
> > directly. Or is there any reason we have to go through the generic
> > helpers path? I mean something like this
> >
> > new_page = alloc_fresh_huge_page();
> > if (!new_page)
> > goto fail;
> > spin_lock(hugetlb_lock);
> > if (!PageHuge(old_page)) {
> > /* freed from under us, nothing to do */
> > __update_and_free_page(new_page);
> > goto unlock;
> > }
> > list_del(&old_page->lru);
> > __update_and_free_page(old_page);
> > __enqueue_huge_page(new_page);
> > unlock:
> > spin_unlock(hugetlb_lock);
> >
> > This will require to split update_and_free_page and enqueue_huge_page to
> > counters independent parts but that shouldn't be a big deal. But it will
> > also protect from any races. Not an act of beauty but seems less hackish
> > to me.
>
> On a closer look, do we really need to decouple update_and_free_page and
> enqueue_huge_page? These two functions do not handle the lock, but rather
> the functions that call them (as would be in our case).
> Only update_and_free_page drops the lock during the freeing of a gigantic page
> and then it takes it again, as the caller is who took the lock.
>
> am I missing anything obvious here?

It is not the lock that I care about but more about counters. The
intention was that there is a single place to handle both enqueing and
dequeing. As not all places require counters to be updated. E.g. the
migration which just replaces one page by another.
--
Michal Hocko
SUSE Labs

2021-02-19 11:18:49

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Fri, Feb 19, 2021 at 11:55:00AM +0100, Michal Hocko wrote:
> It is not the lock that I care about but more about counters. The
> intention was that there is a single place to handle both enqueing and
> dequeing. As not all places require counters to be updated. E.g. the
> migration which just replaces one page by another.

I see.
alloc_fresh_huge_page->prep_new_huge_page increments h->nr_huge_pages{_node}
counters.
Which means:

> new_page = alloc_fresh_huge_page();
> if (!new_page)
> goto fail;
> spin_lock(hugetlb_lock);
> if (!PageHuge(old_page)) {
> /* freed from under us, nothing to do */
> __update_and_free_page(new_page);

Here we need update_and_free_page, otherwise we would be leaving a stale value
in h->nr_huge_pages{_node}.

> goto unlock;
> }
> list_del(&old_page->lru);
> __update_and_free_page(old_page);

Same here.

> __enqueue_huge_page(new_page);

This is ok since h->free_huge_pages{_node} do not need to be updated.

--
Oscar Salvador
SUSE L3

2021-02-19 11:31:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On Fri 19-02-21 12:17:11, Oscar Salvador wrote:
> On Fri, Feb 19, 2021 at 11:55:00AM +0100, Michal Hocko wrote:
> > It is not the lock that I care about but more about counters. The
> > intention was that there is a single place to handle both enqueing and
> > dequeing. As not all places require counters to be updated. E.g. the
> > migration which just replaces one page by another.
>
> I see.
> alloc_fresh_huge_page->prep_new_huge_page increments h->nr_huge_pages{_node}
> counters.
> Which means:
>
> > new_page = alloc_fresh_huge_page();
> > if (!new_page)
> > goto fail;
> > spin_lock(hugetlb_lock);
> > if (!PageHuge(old_page)) {
> > /* freed from under us, nothing to do */
> > __update_and_free_page(new_page);
>
> Here we need update_and_free_page, otherwise we would be leaving a stale value
> in h->nr_huge_pages{_node}.
>
> > goto unlock;
> > }
> > list_del(&old_page->lru);
> > __update_and_free_page(old_page);
>
> Same here.
>
> > __enqueue_huge_page(new_page);
>
> This is ok since h->free_huge_pages{_node} do not need to be updated.

Fair enough. I didn't get to think this through obviously, but you
should get the idea ;)
--
Michal Hocko
SUSE Labs

2021-02-19 20:04:24

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: Make alloc_contig_range handle free hugetlb pages

On 2/19/21 2:14 AM, Oscar Salvador wrote:
> On Fri, Feb 19, 2021 at 10:56:42AM +0100, Michal Hocko wrote:
>> OK, this should work but I am really wondering whether it wouldn't be
>> just simpler to replace the old page by a new one in the free list
>> directly. Or is there any reason we have to go through the generic
>> helpers path? I mean something like this
>>
>> new_page = alloc_fresh_huge_page();
>> if (!new_page)
>> goto fail;
>> spin_lock(hugetlb_lock);
>> if (!PageHuge(old_page)) {

Yes, something like this should work. I'll let Oscar work out the details.
One thing to note is that you also need to check for old_page not on the
free list here. It could have been allocated and in use. In addition,
make sure to check the new flag HPageFreed to ensure page is on free list
before doing any type of update_and_free_page call.
--
Mike Kravetz

>> /* freed from under us, nothing to do */
>> __update_and_free_page(new_page);
>> goto unlock;
>> }
>> list_del(&old_page->lru);
>> __update_and_free_page(old_page);
>> __enqueue_huge_page(new_page);
>> unlock:
>> spin_unlock(hugetlb_lock);
>>
>> This will require to split update_and_free_page and enqueue_huge_page to
>> counters independent parts but that shouldn't be a big deal. But it will
>> also protect from any races. Not an act of beauty but seems less hackish
>> to me.
>
> Yes, I think this would to the trick, and it is race-free.
> Let me play with it a bit and see what I can come up with.
>
> Thanks for the valuable insight.
>