LinuxLists.cc - [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm

2022-10-12 08:52:36

Subject: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

From: "huangjie.albert" <[email protected]>

implement these two functions so that we can set the mempolicy to
the inode of the hugetlb file. This ensures that the mempolicy of
all processes sharing this huge page file is consistent.

In some scenarios where huge pages are shared:
if we need to limit the memory usage of vm within node0, so I set qemu's
mempilciy bind to node0, but if there is a process (such as virtiofsd)
shared memory with the vm, in this case. If the page fault is triggered
by virtiofsd, the allocated memory may go to node1 which depends on
virtiofsd.

Signed-off-by: huangjie.albert <[email protected]>
---
mm/hugetlb.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0ad53ad98e74..ed7599821655 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4678,6 +4678,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
return 0;
}

+#ifdef CONFIG_NUMA
+int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+
+ return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
+}
+
+struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ pgoff_t index;
+
+ index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
+}
+#endif
+
/*
* When a new function is introduced to vm_operations_struct and added
* to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -4691,6 +4709,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
.close = hugetlb_vm_op_close,
.may_split = hugetlb_vm_op_split,
.pagesize = hugetlb_vm_op_pagesize,
+#ifdef CONFIG_NUMA
+ .set_policy = hugetlb_vm_op_set_policy,
+ .get_policy = hugetlb_vm_op_get_policy,
+#endif
};

static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
--
2.31.1

2022-10-12 20:16:30

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

On Wed, 12 Oct 2022, Albert Huang wrote:

> From: "huangjie.albert" <[email protected]>
>
> implement these two functions so that we can set the mempolicy to
> the inode of the hugetlb file. This ensures that the mempolicy of
> all processes sharing this huge page file is consistent.
>
> In some scenarios where huge pages are shared:
> if we need to limit the memory usage of vm within node0, so I set qemu's
> mempilciy bind to node0, but if there is a process (such as virtiofsd)
> shared memory with the vm, in this case. If the page fault is triggered
> by virtiofsd, the allocated memory may go to node1 which depends on
> virtiofsd.
>
> Signed-off-by: huangjie.albert <[email protected]>

Aha! Congratulations for noticing, after all this time. hugetlbfs
contains various little pieces of code that pretend to be supporting
shared NUMA mempolicy, but in fact there was nothing connecting it up.

It will be for Mike to decide, but personally I oppose adding
shared NUMA mempolicy support to hugetlbfs, after eighteen years.

The thing is, it will change the behaviour of NUMA on hugetlbfs:
in ways that would have been sensible way back then, yes; but surely
those who have invested in NUMA and hugetlbfs have developed other
ways of administering it successfully, without shared NUMA mempolicy.

At the least, I would expect some tests to break (I could easily be
wrong), and there's a chance that some app or tool would break too.

I have carried the reverse of Albert's patch for a long time, stripping
out the pretence of shared NUMA mempolicy support from hugetlbfs: I
wanted that, so that I could work on modifying the tmpfs implementation,
without having to worry about other users.

Mike, if you would prefer to see my patch stripping out the pretence,
let us know: it has never been a priority to send in, but I can update
it to 6.1-rc1 if you'd like to see it. (Once upon a time, it removed
all need for struct hugetlbfs_inode_info, but nowadays that's still
required for the memfd seals.)

Whether Albert's patch is complete and correct, I haven't begun to think
about: I am not saying it isn't, but shared NUMA mempolicy adds another
dimension of complexity, and need for support, that I think hugetlbfs
would be better off continuing to survive without.

Hugh

> ---
> mm/hugetlb.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0ad53ad98e74..ed7599821655 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4678,6 +4678,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
> return 0;
> }
>
> +#ifdef CONFIG_NUMA
> +int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
> +{
> + struct inode *inode = file_inode(vma->vm_file);
> +
> + return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
> +}
> +
> +struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
> +{
> + struct inode *inode = file_inode(vma->vm_file);
> + pgoff_t index;
> +
> + index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> + return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
> +}
> +#endif
> +
> /*
> * When a new function is introduced to vm_operations_struct and added
> * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
> @@ -4691,6 +4709,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
> .close = hugetlb_vm_op_close,
> .may_split = hugetlb_vm_op_split,
> .pagesize = hugetlb_vm_op_pagesize,
> +#ifdef CONFIG_NUMA
> + .set_policy = hugetlb_vm_op_set_policy,
> + .get_policy = hugetlb_vm_op_get_policy,
> +#endif
> };
>
> static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> --
> 2.31.1

2022-10-14 17:11:50

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

On 10/12/22 12:45, Hugh Dickins wrote:
> On Wed, 12 Oct 2022, Albert Huang wrote:
>
> > From: "huangjie.albert" <[email protected]>
> >
> > implement these two functions so that we can set the mempolicy to
> > the inode of the hugetlb file. This ensures that the mempolicy of
> > all processes sharing this huge page file is consistent.
> >
> > In some scenarios where huge pages are shared:
> > if we need to limit the memory usage of vm within node0, so I set qemu's
> > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > shared memory with the vm, in this case. If the page fault is triggered
> > by virtiofsd, the allocated memory may go to node1 which depends on
> > virtiofsd.
> >
> > Signed-off-by: huangjie.albert <[email protected]>

Thanks for the patch Albert, and thank you Hugh for the comments!

> Aha! Congratulations for noticing, after all this time. hugetlbfs
> contains various little pieces of code that pretend to be supporting
> shared NUMA mempolicy, but in fact there was nothing connecting it up.

I actually had to look this up to verify it was not supported. However, the
documentation is fairly clear.
From admin-guide/mm/numa_memory_policy.rst.

"As of 2.6.22, only shared memory segments, created by shmget() or
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
policy support was added to Linux, the associated data structures were
added to hugetlbfs shmem segments. At the time, hugetlbfs did not
support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
shmem segments were never "hooked up" to the shared policy support.
Although hugetlbfs segments now support lazy allocation, their support
for shared policy has not been completed."

It is somewhat embarrassing that this has been known for so long and
nothing has changed.

> It will be for Mike to decide, but personally I oppose adding
> shared NUMA mempolicy support to hugetlbfs, after eighteen years.
>
> The thing is, it will change the behaviour of NUMA on hugetlbfs:
> in ways that would have been sensible way back then, yes; but surely
> those who have invested in NUMA and hugetlbfs have developed other
> ways of administering it successfully, without shared NUMA mempolicy.
>
> At the least, I would expect some tests to break (I could easily be
> wrong), and there's a chance that some app or tool would break too.
>
> I have carried the reverse of Albert's patch for a long time, stripping
> out the pretence of shared NUMA mempolicy support from hugetlbfs: I
> wanted that, so that I could work on modifying the tmpfs implementation,
> without having to worry about other users.
>
> Mike, if you would prefer to see my patch stripping out the pretence,
> let us know: it has never been a priority to send in, but I can update
> it to 6.1-rc1 if you'd like to see it. (Once upon a time, it removed
> all need for struct hugetlbfs_inode_info, but nowadays that's still
> required for the memfd seals.)
>
> Whether Albert's patch is complete and correct, I haven't begun to think
> about: I am not saying it isn't, but shared NUMA mempolicy adds another
> dimension of complexity, and need for support, that I think hugetlbfs
> would be better off continuing to survive without.

To be honest, I have not looked into the complexities of shared NUMA
mempolicy and exactly what is required for it's support. With my limited
knowledge, it appears that this patch adds some type of support for shared
policy, but it may not provide all support mentioned in the documentation.

At the very least, this patch should also update documentation to state
what type of support is provided.

Albert, can you look into what would be required for full support? I can take
a look as well but have some other higher priority tasks to work first.

TBH, I like Hugh's idea of removing the 'pretence of shared policy support'.
We are currently wasting memory carrying around extra unused fields in
hugetlbfs_inode_info. :(
--
Mike Kravetz

2022-10-17 03:40:11

by 黄杰

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

Mike Kravetz <[email protected]> 于2022年10月15日周六 00:56写道：
>
> On 10/12/22 12:45, Hugh Dickins wrote:
> > On Wed, 12 Oct 2022, Albert Huang wrote:
> >
> > > From: "huangjie.albert" <[email protected]>
> > >
> > > implement these two functions so that we can set the mempolicy to
> > > the inode of the hugetlb file. This ensures that the mempolicy of
> > > all processes sharing this huge page file is consistent.
> > >
> > > In some scenarios where huge pages are shared:
> > > if we need to limit the memory usage of vm within node0, so I set qemu's
> > > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > > shared memory with the vm, in this case. If the page fault is triggered
> > > by virtiofsd, the allocated memory may go to node1 which depends on
> > > virtiofsd.
> > >
> > > Signed-off-by: huangjie.albert <[email protected]>
>
> Thanks for the patch Albert, and thank you Hugh for the comments!
>
> > Aha! Congratulations for noticing, after all this time. hugetlbfs
> > contains various little pieces of code that pretend to be supporting
> > shared NUMA mempolicy, but in fact there was nothing connecting it up.
>
> I actually had to look this up to verify it was not supported. However, the
> documentation is fairly clear.
> From admin-guide/mm/numa_memory_policy.rst.
>
> "As of 2.6.22, only shared memory segments, created by shmget() or
> mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
> policy support was added to Linux, the associated data structures were
> added to hugetlbfs shmem segments. At the time, hugetlbfs did not
> support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
> shmem segments were never "hooked up" to the shared policy support.
> Although hugetlbfs segments now support lazy allocation, their support
> for shared policy has not been completed."
>
> It is somewhat embarrassing that this has been known for so long and
> nothing has changed.
>
> > It will be for Mike to decide, but personally I oppose adding
> > shared NUMA mempolicy support to hugetlbfs, after eighteen years.
> >
> > The thing is, it will change the behaviour of NUMA on hugetlbfs:
> > in ways that would have been sensible way back then, yes; but surely
> > those who have invested in NUMA and hugetlbfs have developed other
> > ways of administering it successfully, without shared NUMA mempolicy.
> >
> > At the least, I would expect some tests to break (I could easily be
> > wrong), and there's a chance that some app or tool would break too.
> >
> > I have carried the reverse of Albert's patch for a long time, stripping
> > out the pretence of shared NUMA mempolicy support from hugetlbfs: I
> > wanted that, so that I could work on modifying the tmpfs implementation,
> > without having to worry about other users.
> >
> > Mike, if you would prefer to see my patch stripping out the pretence,
> > let us know: it has never been a priority to send in, but I can update
> > it to 6.1-rc1 if you'd like to see it. (Once upon a time, it removed
> > all need for struct hugetlbfs_inode_info, but nowadays that's still
> > required for the memfd seals.)
> >
> > Whether Albert's patch is complete and correct, I haven't begun to think
> > about: I am not saying it isn't, but shared NUMA mempolicy adds another
> > dimension of complexity, and need for support, that I think hugetlbfs
> > would be better off continuing to survive without.
>
> To be honest, I have not looked into the complexities of shared NUMA
> mempolicy and exactly what is required for it's support. With my limited
> knowledge, it appears that this patch adds some type of support for shared
> policy, but it may not provide all support mentioned in the documentation.
>
> At the very least, this patch should also update documentation to state
> what type of support is provided.
>
> Albert, can you look into what would be required for full support? I can take
> a look as well but have some other higher priority tasks to work first.
>

Lucky to do this job, let me think about it.

> TBH, I like Hugh's idea of removing the 'pretence of shared policy support'.
> We are currently wasting memory carrying around extra unused fields in
> hugetlbfs_inode_info. :(
> --
> Mike Kravetz

2022-10-17 09:12:52

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

On 12.10.22 10:15, Albert Huang wrote:
> From: "huangjie.albert" <[email protected]>
>
> implement these two functions so that we can set the mempolicy to
> the inode of the hugetlb file. This ensures that the mempolicy of
> all processes sharing this huge page file is consistent.
>
> In some scenarios where huge pages are shared:
> if we need to limit the memory usage of vm within node0, so I set qemu's
> mempilciy bind to node0, but if there is a process (such as virtiofsd)
> shared memory with the vm, in this case. If the page fault is triggered
> by virtiofsd, the allocated memory may go to node1 which depends on
> virtiofsd.
>

Any VM that uses hugetlb should be preallocating memory. For example,
this is the expected default under QEMU when using huge pages.

Once preallocation does the right thing regarding NUMA policy, there is
no need to worry about it in other sub-processes.

--
Thanks,

David / dhildenb

2022-10-17 10:04:25

by 黄杰

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

David Hildenbrand <[email protected]> 于2022年10月17日周一 16:44写道：
>
> On 12.10.22 10:15, Albert Huang wrote:
> > From: "huangjie.albert" <[email protected]>
> >
> > implement these two functions so that we can set the mempolicy to
> > the inode of the hugetlb file. This ensures that the mempolicy of
> > all processes sharing this huge page file is consistent.
> >
> > In some scenarios where huge pages are shared:
> > if we need to limit the memory usage of vm within node0, so I set qemu's
> > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > shared memory with the vm, in this case. If the page fault is triggered
> > by virtiofsd, the allocated memory may go to node1 which depends on
> > virtiofsd.
> >
>
> Any VM that uses hugetlb should be preallocating memory. For example,
> this is the expected default under QEMU when using huge pages.
>
> Once preallocation does the right thing regarding NUMA policy, there is
> no need to worry about it in other sub-processes.
>

Hi, David
thanks for your reminder

Yes, you are absolutely right, However, the pre-allocation mechanism
does solve this problem.
However, some scenarios do not like to use the pre-allocation mechanism, such as
scenarios that are sensitive to virtual machine startup time, or
scenarios that require
high memory utilization. The on-demand allocation mechanism may be better,
so the key point is to find a way support for shared policy。

Thanks,

Albert

> --
> Thanks,
>
> David / dhildenb
>

2022-10-17 12:03:14

by David Hildenbrand

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

On 17.10.22 11:48, 黄杰 wrote:
> David Hildenbrand <[email protected]> 于2022年10月17日周一 16:44写道：
>>
>> On 12.10.22 10:15, Albert Huang wrote:
>>> From: "huangjie.albert" <[email protected]>
>>>
>>> implement these two functions so that we can set the mempolicy to
>>> the inode of the hugetlb file. This ensures that the mempolicy of
>>> all processes sharing this huge page file is consistent.
>>>
>>> In some scenarios where huge pages are shared:
>>> if we need to limit the memory usage of vm within node0, so I set qemu's
>>> mempilciy bind to node0, but if there is a process (such as virtiofsd)
>>> shared memory with the vm, in this case. If the page fault is triggered
>>> by virtiofsd, the allocated memory may go to node1 which depends on
>>> virtiofsd.
>>>
>>
>> Any VM that uses hugetlb should be preallocating memory. For example,
>> this is the expected default under QEMU when using huge pages.
>>
>> Once preallocation does the right thing regarding NUMA policy, there is
>> no need to worry about it in other sub-processes.
>>
>
> Hi, David
> thanks for your reminder
>
> Yes, you are absolutely right, However, the pre-allocation mechanism
> does solve this problem.
> However, some scenarios do not like to use the pre-allocation mechanism, such as
> scenarios that are sensitive to virtual machine startup time, or
> scenarios that require
> high memory utilization. The on-demand allocation mechanism may be better,
> so the key point is to find a way support for shared policy。

Using hugetlb -- with a fixed pool size -- without preallocation is like
playing with fire. Hugetlb reservation makes one believe that on-demand
allocation is going to work, but there are various scenarios where that
can go seriously wrong, and you can run out of huge pages.

If you're using hugetlb as memory backend for a VM without
preallocation, you really have to be very careful. I can only advise
against doing that.

Also: why does another process read/write *first* to a guest physical
memory location before the OS running inside the VM even initialized
that memory? That sounds very wrong. What am I missing?

--
Thanks,

David / dhildenb

2022-10-17 18:03:06

by Mike Kravetz

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

On 10/17/22 13:33, David Hildenbrand wrote:
> On 17.10.22 11:48, 黄杰 wrote:
> > David Hildenbrand <[email protected]> 于2022年10月17日周一 16:44写道：
> > >
> > > On 12.10.22 10:15, Albert Huang wrote:
> > > > From: "huangjie.albert" <[email protected]>
> > > >
> > > > implement these two functions so that we can set the mempolicy to
> > > > the inode of the hugetlb file. This ensures that the mempolicy of
> > > > all processes sharing this huge page file is consistent.
> > > >
> > > > In some scenarios where huge pages are shared:
> > > > if we need to limit the memory usage of vm within node0, so I set qemu's
> > > > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > > > shared memory with the vm, in this case. If the page fault is triggered
> > > > by virtiofsd, the allocated memory may go to node1 which depends on
> > > > virtiofsd.
> > > >
> > >
> > > Any VM that uses hugetlb should be preallocating memory. For example,
> > > this is the expected default under QEMU when using huge pages.
> > >
> > > Once preallocation does the right thing regarding NUMA policy, there is
> > > no need to worry about it in other sub-processes.
> > >
> >
> > Hi, David
> > thanks for your reminder
> >
> > Yes, you are absolutely right, However, the pre-allocation mechanism
> > does solve this problem.
> > However, some scenarios do not like to use the pre-allocation mechanism, such as
> > scenarios that are sensitive to virtual machine startup time, or
> > scenarios that require
> > high memory utilization. The on-demand allocation mechanism may be better,
> > so the key point is to find a way support for shared policy。
>
> Using hugetlb -- with a fixed pool size -- without preallocation is like
> playing with fire. Hugetlb reservation makes one believe that on-demand
> allocation is going to work, but there are various scenarios where that can
> go seriously wrong, and you can run out of huge pages.

I absolutely agree with this cautionary note.

hugetlb reservations guarantee that a sufficient number of huge pages exist.
However, there is no guarantee that those pages are on any specific node
associated with a numa policy. Therefore, an 'on demand' allocation could
fail resulting in SIGBUS being set to the faulting process.
--
Mike Kravetz

2022-10-18 09:44:06

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

On 18.10.22 11:27, 黄杰 wrote:
> David Hildenbrand <[email protected]> 于2022年10月17日周一 20:00写道：
>>
>> On 17.10.22 13:46, 黄杰 wrote:
>>> David Hildenbrand <[email protected]> 于2022年10月17日周一 19:33写道：
>>>>
>>>> On 17.10.22 11:48, 黄杰 wrote:
>>>>> David Hildenbrand <[email protected]> 于2022年10月17日周一 16:44写道：
>>>>>>
>>>>>> On 12.10.22 10:15, Albert Huang wrote:
>>>>>>> From: "huangjie.albert" <[email protected]>
>>>>>>>
>>>>>>> implement these two functions so that we can set the mempolicy to
>>>>>>> the inode of the hugetlb file. This ensures that the mempolicy of
>>>>>>> all processes sharing this huge page file is consistent.
>>>>>>>
>>>>>>> In some scenarios where huge pages are shared:
>>>>>>> if we need to limit the memory usage of vm within node0, so I set qemu's
>>>>>>> mempilciy bind to node0, but if there is a process (such as virtiofsd)
>>>>>>> shared memory with the vm, in this case. If the page fault is triggered
>>>>>>> by virtiofsd, the allocated memory may go to node1 which depends on
>>>>>>> virtiofsd.
>>>>>>>
>>>>>>
>>>>>> Any VM that uses hugetlb should be preallocating memory. For example,
>>>>>> this is the expected default under QEMU when using huge pages.
>>>>>>
>>>>>> Once preallocation does the right thing regarding NUMA policy, there is
>>>>>> no need to worry about it in other sub-processes.
>>>>>>
>>>>>
>>>>> Hi, David
>>>>> thanks for your reminder
>>>>>
>>>>> Yes, you are absolutely right, However, the pre-allocation mechanism
>>>>> does solve this problem.
>>>>> However, some scenarios do not like to use the pre-allocation mechanism, such as
>>>>> scenarios that are sensitive to virtual machine startup time, or
>>>>> scenarios that require
>>>>> high memory utilization. The on-demand allocation mechanism may be better,
>>>>> so the key point is to find a way support for shared policy。
>>>>
>>>> Using hugetlb -- with a fixed pool size -- without preallocation is like
>>>> playing with fire. Hugetlb reservation makes one believe that on-demand
>>>> allocation is going to work, but there are various scenarios where that
>>>> can go seriously wrong, and you can run out of huge pages.
>>>>
>>>> If you're using hugetlb as memory backend for a VM without
>>>> preallocation, you really have to be very careful. I can only advise
>>>> against doing that.
>>>>
>>>>
>>>> Also: why does another process read/write *first* to a guest physical
>>>> memory location before the OS running inside the VM even initialized
>>>> that memory? That sounds very wrong. What am I missing?
>>>>
>>>
>>> for example : virtio ring buffer.
>>> For the avial descriptor, the guest kernel only gives an address to
>>> the backend,
>>> and does not actually access the memory.
>>
>> Okay, thanks. So we're essentially providing uninitialized memory to a
>> device? Hm, that implies that the device might have access to memory
>> that was previously used by someone else ... not sure how to feel about
>> that, but maybe this is just the way of doing things.
>>
>> The "easy" user-space fix would be to simply similarly mbind() in the
>> other processes where we mmap(). Has that option been explored?
>
> This can also solve the problem temporarily, but we need to change all
> processes that share memory with it, so it can't be done once and for
> all

How many process that really care are we walking about? Usually,
extending the vhost-user protocol and relevant libraries should be good
enough.

--
Thanks,

David / dhildenb

2022-10-18 09:53:43

by 黄杰

[permalink] [raw]

Subject: Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

Mike Kravetz <[email protected]> 于2022年10月18日周二 01:59写道：
>
> On 10/17/22 13:33, David Hildenbrand wrote:
> > On 17.10.22 11:48, 黄杰 wrote:
> > > David Hildenbrand <[email protected]> 于2022年10月17日周一 16:44写道：
> > > >
> > > > On 12.10.22 10:15, Albert Huang wrote:
> > > > > From: "huangjie.albert" <[email protected]>
> > > > >
> > > > > implement these two functions so that we can set the mempolicy to
> > > > > the inode of the hugetlb file. This ensures that the mempolicy of
> > > > > all processes sharing this huge page file is consistent.
> > > > >
> > > > > In some scenarios where huge pages are shared:
> > > > > if we need to limit the memory usage of vm within node0, so I set qemu's
> > > > > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > > > > shared memory with the vm, in this case. If the page fault is triggered
> > > > > by virtiofsd, the allocated memory may go to node1 which depends on
> > > > > virtiofsd.
> > > > >
> > > >
> > > > Any VM that uses hugetlb should be preallocating memory. For example,
> > > > this is the expected default under QEMU when using huge pages.
> > > >
> > > > Once preallocation does the right thing regarding NUMA policy, there is
> > > > no need to worry about it in other sub-processes.
> > > >
> > >
> > > Hi, David
> > > thanks for your reminder
> > >
> > > Yes, you are absolutely right, However, the pre-allocation mechanism
> > > does solve this problem.
> > > However, some scenarios do not like to use the pre-allocation mechanism, such as
> > > scenarios that are sensitive to virtual machine startup time, or
> > > scenarios that require
> > > high memory utilization. The on-demand allocation mechanism may be better,
> > > so the key point is to find a way support for shared policy。
> >
> > Using hugetlb -- with a fixed pool size -- without preallocation is like
> > playing with fire. Hugetlb reservation makes one believe that on-demand
> > allocation is going to work, but there are various scenarios where that can
> > go seriously wrong, and you can run out of huge pages.
>
> I absolutely agree with this cautionary note.
>
> hugetlb reservations guarantee that a sufficient number of huge pages exist.
> However, there is no guarantee that those pages are on any specific node
> associated with a numa policy. Therefore, an 'on demand' allocation could
> fail resulting in SIGBUS being set to the faulting process.
> -

Yes, supporting on-demand requires adding a lot of other code to
support, I have thought about this, but there is currently no code
that is suitable for submitting to the community.

> Mike Kravetz

2022-10-19 10:33:39

by 黄杰

[permalink] [raw]

Subject: Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

Hugh Dickins <[email protected]> 于2022年10月13日周四 03:45写道：
>
> On Wed, 12 Oct 2022, Albert Huang wrote:
>
> > From: "huangjie.albert" <[email protected]>
> >
> > implement these two functions so that we can set the mempolicy to
> > the inode of the hugetlb file. This ensures that the mempolicy of
> > all processes sharing this huge page file is consistent.
> >
> > In some scenarios where huge pages are shared:
> > if we need to limit the memory usage of vm within node0, so I set qemu's
> > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > shared memory with the vm, in this case. If the page fault is triggered
> > by virtiofsd, the allocated memory may go to node1 which depends on
> > virtiofsd.
> >
> > Signed-off-by: huangjie.albert <[email protected]>
>
> Aha! Congratulations for noticing, after all this time. hugetlbfs
> contains various little pieces of code that pretend to be supporting
> shared NUMA mempolicy, but in fact there was nothing connecting it up.
>
> It will be for Mike to decide, but personally I oppose adding
> shared NUMA mempolicy support to hugetlbfs, after eighteen years.
>
> The thing is, it will change the behaviour of NUMA on hugetlbfs:
> in ways that would have been sensible way back then, yes; but surely
> those who have invested in NUMA and hugetlbfs have developed other
> ways of administering it successfully, without shared NUMA mempolicy.
>
> At the least, I would expect some tests to break (I could easily be
> wrong), and there's a chance that some app or tool would break too.

Hi : Hugh

Can you share some issues here?

Thanks.

>
> I have carried the reverse of Albert's patch for a long time, stripping
> out the pretence of shared NUMA mempolicy support from hugetlbfs: I
> wanted that, so that I could work on modifying the tmpfs implementation,
> without having to worry about other users.
>
> Mike, if you would prefer to see my patch stripping out the pretence,
> let us know: it has never been a priority to send in, but I can update
> it to 6.1-rc1 if you'd like to see it. (Once upon a time, it removed
> all need for struct hugetlbfs_inode_info, but nowadays that's still
> required for the memfd seals.)
>
> Whether Albert's patch is complete and correct, I haven't begun to think
> about: I am not saying it isn't, but shared NUMA mempolicy adds another
> dimension of complexity, and need for support, that I think hugetlbfs
> would be better off continuing to survive without.
>
> Hugh
>
> > ---
> > mm/hugetlb.c | 22 ++++++++++++++++++++++
> > 1 file changed, 22 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 0ad53ad98e74..ed7599821655 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4678,6 +4678,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
> > return 0;
> > }
> >
> > +#ifdef CONFIG_NUMA
> > +int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
> > +{
> > + struct inode *inode = file_inode(vma->vm_file);
> > +
> > + return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
> > +}
> > +
> > +struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
> > +{
> > + struct inode *inode = file_inode(vma->vm_file);
> > + pgoff_t index;
> > +
> > + index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > + return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
> > +}
> > +#endif
> > +
> > /*
> > * When a new function is introduced to vm_operations_struct and added
> > * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
> > @@ -4691,6 +4709,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
> > .close = hugetlb_vm_op_close,
> > .may_split = hugetlb_vm_op_split,
> > .pagesize = hugetlb_vm_op_pagesize,
> > +#ifdef CONFIG_NUMA
> > + .set_policy = hugetlb_vm_op_set_policy,
> > + .get_policy = hugetlb_vm_op_get_policy,
> > +#endif
> > };
> >
> > static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> > --
> > 2.31.1

2022-10-19 11:13:01

by 黄杰

[permalink] [raw]

Subject: [PATCH v2] mm: hugetlb: support for shared memory policy

From: "huangjie.albert" <[email protected]>

implement get/set_policy for hugetlb_vm_ops to support the shared policy
This ensures that the mempolicy of all processes sharing this huge page
file is consistent.

In some scenarios where huge pages are shared:
if we need to limit the memory usage of vm within node0, so I set qemu's
mempilciy bind to node0, but if there is a process (such as virtiofsd)
shared memory with the vm, in this case. If the page fault is triggered
by virtiofsd, the allocated memory may go to node1 which depends on
virtiofsd. Although we can use the memory prealloc provided by qemu to
avoid this issue, but this method will significantly increase the
creation time of the vm(a few seconds, depending on memory size).

after we hooked up hugetlb_vm_ops(set/get_policy):
both the shared memory segments created by shmget() with SHM_HUGETLB flag
and the mmap(MAP_SHARED|MAP_HUGETLB), also support shared policy.

v1->v2:
1、hugetlb share the memory policy when the vma with the VM_SHARED flag.
2、update the documentation.

Signed-off-by: huangjie.albert <[email protected]>
---
.../admin-guide/mm/numa_memory_policy.rst | 20 +++++++++------
mm/hugetlb.c | 25 +++++++++++++++++++
2 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 5a6afecbb0d0..5672a6c2d2ef 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -133,14 +133,18 @@ Shared Policy
the object share the policy, and all pages allocated for the
shared object, by any task, will obey the shared policy.

- As of 2.6.22, only shared memory segments, created by shmget() or
- mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
- policy support was added to Linux, the associated data structures were
- added to hugetlbfs shmem segments. At the time, hugetlbfs did not
- support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
- shmem segments were never "hooked up" to the shared policy support.
- Although hugetlbfs segments now support lazy allocation, their support
- for shared policy has not been completed.
+ As of 2.6.22, only shared memory segments, created by shmget() without
+ SHM_HUGETLB flag or mmap(MAP_ANONYMOUS|MAP_SHARED) without MAP_HUGETLB
+ flag, support shared policy. When shared policy support was added to Linux,
+ the associated data structures were added to hugetlbfs shmem segments.
+ At the time, hugetlbfs did not support allocation at fault time--a.k.a
+ lazy allocation--so hugetlbfs shmem segments were never "hooked up" to
+ the shared policy support. Although hugetlbfs segments now support lazy
+ allocation, their support for shared policy has not been completed.
+
+ after we hooked up hugetlb_vm_ops(set/get_policy):
+ both the shared memory segments created by shmget() with SHM_HUGETLB flag
+ and mmap(MAP_SHARED|MAP_HUGETLB), also support shared policy.

As mentioned above in :ref:`VMA policies <vma_policy>` section,
allocations of page cache pages for regular files mmap()ed
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87d875e5e0a9..fc7038931832 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4632,6 +4632,27 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
return 0;
}

+#ifdef CONFIG_NUMA
+int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+
+ if (!(vma->vm_flags & VM_SHARED))
+ return 0;
+
+ return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
+}
+
+struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ pgoff_t index;
+
+ index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
+}
+#endif
+
/*
* When a new function is introduced to vm_operations_struct and added
* to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
@@ -4645,6 +4666,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
.close = hugetlb_vm_op_close,
.may_split = hugetlb_vm_op_split,
.pagesize = hugetlb_vm_op_pagesize,
+#ifdef CONFIG_NUMA
+ .set_policy = hugetlb_vm_op_set_policy,
+ .get_policy = hugetlb_vm_op_get_policy,
+#endif
};

static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
--
2.37.0 (Apple Git-136)

2022-10-19 12:25:35

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCH v2] mm: hugetlb: support for shared memory policy

On 10/19/22 2:59 PM, Albert Huang wrote:
> From: "huangjie.albert" <[email protected]>
>
> implement get/set_policy for hugetlb_vm_ops to support the shared policy
> This ensures that the mempolicy of all processes sharing this huge page
> file is consistent.
>
> In some scenarios where huge pages are shared:
> if we need to limit the memory usage of vm within node0, so I set qemu's
> mempilciy bind to node0, but if there is a process (such as virtiofsd)
> shared memory with the vm, in this case. If the page fault is triggered
> by virtiofsd, the allocated memory may go to node1 which depends on
> virtiofsd. Although we can use the memory prealloc provided by qemu to
> avoid this issue, but this method will significantly increase the
> creation time of the vm(a few seconds, depending on memory size).
>
> after we hooked up hugetlb_vm_ops(set/get_policy):
> both the shared memory segments created by shmget() with SHM_HUGETLB flag
> and the mmap(MAP_SHARED|MAP_HUGETLB), also support shared policy.
>
> v1->v2:
> 1、hugetlb share the memory policy when the vma with the VM_SHARED flag.
> 2、update the documentation.
>
> Signed-off-by: huangjie.albert <[email protected]>
> ---
> .../admin-guide/mm/numa_memory_policy.rst | 20 +++++++++------
> mm/hugetlb.c | 25 +++++++++++++++++++
> 2 files changed, 37 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index 5a6afecbb0d0..5672a6c2d2ef 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -133,14 +133,18 @@ Shared Policy
> the object share the policy, and all pages allocated for the
> shared object, by any task, will obey the shared policy.
>
> - As of 2.6.22, only shared memory segments, created by shmget() or
> - mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
> - policy support was added to Linux, the associated data structures were
> - added to hugetlbfs shmem segments. At the time, hugetlbfs did not
> - support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
> - shmem segments were never "hooked up" to the shared policy support.
> - Although hugetlbfs segments now support lazy allocation, their support
> - for shared policy has not been completed.
> + As of 2.6.22, only shared memory segments, created by shmget() without
> + SHM_HUGETLB flag or mmap(MAP_ANONYMOUS|MAP_SHARED) without MAP_HUGETLB
> + flag, support shared policy. When shared policy support was added to Linux,
> + the associated data structures were added to hugetlbfs shmem segments.
> + At the time, hugetlbfs did not support allocation at fault time--a.k.a
> + lazy allocation--so hugetlbfs shmem segments were never "hooked up" to
> + the shared policy support. Although hugetlbfs segments now support lazy
> + allocation, their support for shared policy has not been completed.
> +
> + after we hooked up hugetlb_vm_ops(set/get_policy):
> + both the shared memory segments created by shmget() with SHM_HUGETLB flag
> + and mmap(MAP_SHARED|MAP_HUGETLB), also support shared policy.
>
> As mentioned above in :ref:`VMA policies <vma_policy>` section,
> allocations of page cache pages for regular files mmap()ed
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 87d875e5e0a9..fc7038931832 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4632,6 +4632,27 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
> return 0;
> }
>
> +#ifdef CONFIG_NUMA
> +int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
> +{
> + struct inode *inode = file_inode(vma->vm_file);
> +
> + if (!(vma->vm_flags & VM_SHARED))
> + return 0;
> +
> + return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
> +}
> +
> +struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
> +{
> + struct inode *inode = file_inode(vma->vm_file);
> + pgoff_t index;
> +
> + index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> + return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
> +}
> +#endif
> +
> /*
> * When a new function is introduced to vm_operations_struct and added
> * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
> @@ -4645,6 +4666,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
> .close = hugetlb_vm_op_close,
> .may_split = hugetlb_vm_op_split,
> .pagesize = hugetlb_vm_op_pagesize,
> +#ifdef CONFIG_NUMA
> + .set_policy = hugetlb_vm_op_set_policy,
> + .get_policy = hugetlb_vm_op_get_policy,
> +#endif
> };
>
> static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,

How is the current usage of

/* Set numa allocation policy based on index */
hugetlb_set_vma_policy(&pseudo_vma, inode, index);

enforcing the policy with the current code? Also if we have get_policy()

Can we remove the usage of the same in hugetlbfs_fallocate()
after this patch? With shared policy we should be able to fetch
the policy via get_vma_policy()?

A related question does shm_pseudo_vma_init() requires that mpolicy_lookup?

-aneesh

2022-10-23 20:38:02

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

On Wed, 19 Oct 2022, 黄杰 wrote:
> Hugh Dickins <[email protected]> 于2022年10月13日周四 03:45写道：
> > On Wed, 12 Oct 2022, Albert Huang wrote:
> > > From: "huangjie.albert" <[email protected]>
> > >
> > > implement these two functions so that we can set the mempolicy to
> > > the inode of the hugetlb file. This ensures that the mempolicy of
> > > all processes sharing this huge page file is consistent.
> > >
> > > In some scenarios where huge pages are shared:
> > > if we need to limit the memory usage of vm within node0, so I set qemu's
> > > mempilciy bind to node0, but if there is a process (such as virtiofsd)
> > > shared memory with the vm, in this case. If the page fault is triggered
> > > by virtiofsd, the allocated memory may go to node1 which depends on
> > > virtiofsd.
> > >
> > > Signed-off-by: huangjie.albert <[email protected]>
> >
> > Aha! Congratulations for noticing, after all this time. hugetlbfs
> > contains various little pieces of code that pretend to be supporting
> > shared NUMA mempolicy, but in fact there was nothing connecting it up.
> >
> > It will be for Mike to decide, but personally I oppose adding
> > shared NUMA mempolicy support to hugetlbfs, after eighteen years.
> >
> > The thing is, it will change the behaviour of NUMA on hugetlbfs:
> > in ways that would have been sensible way back then, yes; but surely
> > those who have invested in NUMA and hugetlbfs have developed other
> > ways of administering it successfully, without shared NUMA mempolicy.
> >
> > At the least, I would expect some tests to break (I could easily be
> > wrong), and there's a chance that some app or tool would break too.
>
> Hi : Hugh
>
> Can you share some issues here?

Sorry, I don't think I can: precisely because it's been such a relief
to know that hugetlbfs is not in the shared NUMA mempolicy game, I've
given no thought to what issues it might have if it joined the game.

Not much memory is wasted on the unused fields in hugetlbfs_inode_info,
just a few bytes per inode, that aspect doesn't concern me much.

Reference counting of shared mempolicy has certainly been a recurrent
problem in the past (see mpol_needs_cond_ref() etc): stable nowadays
I believe; but whether supporting hugetlbfs would cause new problems
to surface there, I don't know; but whatever, those would just be
bugs to be fixed.

/proc/pid/numa_maps does not represent shared NUMA mempolicies
correctly: not for tmpfs, and would not for hugetlbfs. I did have
old patches to fix that, but not patches that I'm ever likely to
have time to resurrect and present and push.

My main difficulties in tmpfs were with how to deal correctly and
consistently with non-hugepage-aligned mempolicies when hugepages
are in use. In the case of hugetlbfs, it would be simpler, since
you're always dealing in hugepages of a known size: I recommend
being as strict as possible, demanding correctly aligned mempolicy
or else EINVAL. (That may already be enforced, I've not looked.)

But my main concern in extending shared NUMA mempolicy to hugetlbfs
is exactly what I already said earlier:

The thing is, it will change the behaviour of NUMA on hugetlbfs:
in ways that would have been sensible way back then, yes; but surely
those who have invested in NUMA and hugetlbfs have developed other
ways of administering it successfully, without shared NUMA mempolicy.

At the least, I would expect some tests to break (I could easily be
wrong), and there's a chance that some app or tool would break too.

It's a risk, and a body of complication, that I would keep away from
myself. The shared mempolicy rbtree: makes sense, but no madvise()
since has implemented such a tree, to attach its advice to ranges
of the shared object rather than to vma.

Hugh