2023-09-25 11:26:26

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 01/12] hugetlbfs: drop shared NUMA mempolicy pretence

hugetlbfs_fallocate() goes through the motions of pasting a shared NUMA
mempolicy onto its pseudo-vma, but how could there ever be a shared NUMA
mempolicy for this file? hugetlb_vm_ops has never offered a set_policy
method, and hugetlbfs_parse_param() has never supported any mpol options
for a mount-wide default policy.

It's just an illusion: clean it away so as not to confuse others, giving
us more freedom to adjust shmem's set_policy/get_policy implementation.
But hugetlbfs_inode_info is still required, just to accommodate seals.

Yes, shared NUMA mempolicy support could be added to hugetlbfs, with a
set_policy method and/or mpol mount option (Andi's first posting did
include an admitted-unsatisfactory hugetlb_set_policy()); but it seems
that nobody has bothered to add that in the nineteen years since v2.6.7
made it possible, and there is at least one company that has invested
enough into hugetlbfs, that I guess they have learnt well enough how to
manage its NUMA, without needing shared mempolicy.

Signed-off-by: Hugh Dickins <[email protected]>
---
fs/hugetlbfs/inode.c | 41 +----------------------------------------
include/linux/hugetlb.h | 2 --
2 files changed, 1 insertion(+), 42 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 316c4cebd3f3..ffee27b10d42 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -83,29 +83,6 @@ static const struct fs_parameter_spec hugetlb_fs_parameters[] = {
{}
};

-#ifdef CONFIG_NUMA
-static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
- struct inode *inode, pgoff_t index)
-{
- vma->vm_policy = mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy,
- index);
-}
-
-static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
-{
- mpol_cond_put(vma->vm_policy);
-}
-#else
-static inline void hugetlb_set_vma_policy(struct vm_area_struct *vma,
- struct inode *inode, pgoff_t index)
-{
-}
-
-static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
-{
-}
-#endif
-
/*
* Mask used when checking the page offset value passed in via system
* calls. This value will be converted to a loff_t which is signed.
@@ -852,8 +829,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,

/*
* Initialize a pseudo vma as this is required by the huge page
- * allocation routines. If NUMA is configured, use page index
- * as input to create an allocation policy.
+ * allocation routines.
*/
vma_init(&pseudo_vma, mm);
vm_flags_init(&pseudo_vma, VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
@@ -901,9 +877,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
* folios in these areas, we need to consume the reserves
* to keep reservation accounting consistent.
*/
- hugetlb_set_vma_policy(&pseudo_vma, inode, index);
folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0);
- hugetlb_drop_vma_policy(&pseudo_vma);
if (IS_ERR(folio)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
error = PTR_ERR(folio);
@@ -1282,18 +1256,6 @@ static struct inode *hugetlbfs_alloc_inode(struct super_block *sb)
hugetlbfs_inc_free_inodes(sbinfo);
return NULL;
}
-
- /*
- * Any time after allocation, hugetlbfs_destroy_inode can be called
- * for the inode. mpol_free_shared_policy is unconditionally called
- * as part of hugetlbfs_destroy_inode. So, initialize policy here
- * in case of a quick call to destroy.
- *
- * Note that the policy is initialized even if we are creating a
- * private inode. This simplifies hugetlbfs_destroy_inode.
- */
- mpol_shared_policy_init(&p->policy, NULL);
-
return &p->vfs_inode;
}

@@ -1305,7 +1267,6 @@ static void hugetlbfs_free_inode(struct inode *inode)
static void hugetlbfs_destroy_inode(struct inode *inode)
{
hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
- mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
}

static const struct address_space_operations hugetlbfs_aops = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5b2626063f4f..6522eb3cd007 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -30,7 +30,6 @@ void free_huge_folio(struct folio *folio);

#ifdef CONFIG_HUGETLB_PAGE

-#include <linux/mempolicy.h>
#include <linux/shm.h>
#include <asm/tlbflush.h>

@@ -512,7 +511,6 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
}

struct hugetlbfs_inode_info {
- struct shared_policy policy;
struct inode vfs_inode;
unsigned int seals;
};
--
2.35.3


2023-09-25 22:51:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 01/12] hugetlbfs: drop shared NUMA mempolicy pretence

On Mon, Sep 25, 2023 at 01:21:10AM -0700, Hugh Dickins wrote:
> hugetlbfs_fallocate() goes through the motions of pasting a shared NUMA
> mempolicy onto its pseudo-vma, but how could there ever be a shared NUMA
> mempolicy for this file? hugetlb_vm_ops has never offered a set_policy
> method, and hugetlbfs_parse_param() has never supported any mpol options
> for a mount-wide default policy.
>
> It's just an illusion: clean it away so as not to confuse others, giving
> us more freedom to adjust shmem's set_policy/get_policy implementation.
> But hugetlbfs_inode_info is still required, just to accommodate seals.
>
> Yes, shared NUMA mempolicy support could be added to hugetlbfs, with a
> set_policy method and/or mpol mount option (Andi's first posting did
> include an admitted-unsatisfactory hugetlb_set_policy()); but it seems
> that nobody has bothered to add that in the nineteen years since v2.6.7
> made it possible, and there is at least one company that has invested
> enough into hugetlbfs, that I guess they have learnt well enough how to
> manage its NUMA, without needing shared mempolicy.

TBH i'm not sure people in general rely on shared mempolicy. The
original use case for it was to modify the numa policy of non anonymous
shared memory files without modifying the program (e.g. Oracle
database's shared memory segments)

But I don't think that particular usage model ever got any real
traction: at leas I haven't seen any real usage of it outside my tests.

I suspect people either are fine with just process policy or modify the
program, in which case it's not a big burden to modify every user,
so process policy or vma based mbind policy works fine.

Maybe it would be an interesting experiment to disable it everywhere
with some flag and see if anybody complains.

On the other hand it might be Hyrum'ed by now.

-Andi

2023-09-26 05:19:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 01/12] hugetlbfs: drop shared NUMA mempolicy pretence

On Mon, Sep 25, 2023 at 01:21:10AM -0700, Hugh Dickins wrote:
> hugetlbfs_fallocate() goes through the motions of pasting a shared NUMA
> mempolicy onto its pseudo-vma, but how could there ever be a shared NUMA
> mempolicy for this file? hugetlb_vm_ops has never offered a set_policy
> method, and hugetlbfs_parse_param() has never supported any mpol options
> for a mount-wide default policy.

Hah. I was wondering, but never cared enough to investigate. Thanks
for doing this.

Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

2023-09-27 00:32:03

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 01/12] hugetlbfs: drop shared NUMA mempolicy pretence

On Mon, 25 Sep 2023, Andi Kleen wrote:
> On Mon, Sep 25, 2023 at 01:21:10AM -0700, Hugh Dickins wrote:
> > hugetlbfs_fallocate() goes through the motions of pasting a shared NUMA
> > mempolicy onto its pseudo-vma, but how could there ever be a shared NUMA
> > mempolicy for this file? hugetlb_vm_ops has never offered a set_policy
> > method, and hugetlbfs_parse_param() has never supported any mpol options
> > for a mount-wide default policy.
> >
> > It's just an illusion: clean it away so as not to confuse others, giving
> > us more freedom to adjust shmem's set_policy/get_policy implementation.
> > But hugetlbfs_inode_info is still required, just to accommodate seals.
> >
> > Yes, shared NUMA mempolicy support could be added to hugetlbfs, with a
> > set_policy method and/or mpol mount option (Andi's first posting did
> > include an admitted-unsatisfactory hugetlb_set_policy()); but it seems
> > that nobody has bothered to add that in the nineteen years since v2.6.7
> > made it possible, and there is at least one company that has invested
> > enough into hugetlbfs, that I guess they have learnt well enough how to
> > manage its NUMA, without needing shared mempolicy.
>
> TBH i'm not sure people in general rely on shared mempolicy. The
> original use case for it was to modify the numa policy of non anonymous
> shared memory files without modifying the program (e.g. Oracle
> database's shared memory segments)

Ah, "without modifying the program": that makes a lot of sense, but
I had never thought of it that way - I just saw it as the right way to
manage the shared object (though an outlier, since we have so many other
msyscall()s which do not manage the underlying shared object in this way).

>
> But I don't think that particular usage model ever got any real
> traction: at leas I haven't seen any real usage of it outside my tests.

If the hugetlbfs support had actually gone in, I imagine Oracle would
have managed it that way; but they seem to have survived well without.

>
> I suspect people either are fine with just process policy or modify the
> program, in which case it's not a big burden to modify every user,
> so process policy or vma based mbind policy works fine.
>
> Maybe it would be an interesting experiment to disable it everywhere
> with some flag and see if anybody complains.
>
> On the other hand it might be Hyrum'ed by now.

This is interesting info, Andi, thank you for providing.

I'm torn. shmem and mempolicy (and struct vm_operations_struct) would
certainly be simpler without shared mempolicy: but I frankly don't have
the time and courage to experiment with deprecating it now; and it is
fundamentally right that such policy should be kept with the object,
not with its mappings. I've assumed for years that it has to stay.

Hugh