2018-08-20 08:57:26

by Oscar Salvador

[permalink] [raw]
Subject: [PATCH] mm: Fix comment for NODEMASK_ALLOC

From: Oscar Salvador <[email protected]>

Currently, NODEMASK_ALLOC allocates a nodemask_t with kmalloc when
NODES_SHIFT is higher than 8, otherwise it declares it within the stack.

The comment says that the reasoning behind this, is that nodemask_t will be
256 bytes when NODES_SHIFT is higher than 8, but this is not true.
For example, NODES_SHIFT = 9 will give us a 64 bytes nodemask_t.
Let us fix up the comment for that.

Another thing is that it might make sense to let values lower than 128bytes
be allocated in the stack.
Although this all depends on the depth of the stack
(and this changes from function to function), I think that 64 bytes
is something we can easily afford.
So we could even bump the limit by 1 (from > 8 to > 9).

Signed-off-by: Oscar Salvador <[email protected]>
---
include/linux/nodemask.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 1fbde8a880d9..5a30ad594ccc 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -518,7 +518,7 @@ static inline int node_random(const nodemask_t *mask)
* NODEMASK_ALLOC(type, name) allocates an object with a specified type and
* name.
*/
-#if NODES_SHIFT > 8 /* nodemask_t > 256 bytes */
+#if NODES_SHIFT > 8 /* nodemask_t > 32 bytes */
#define NODEMASK_ALLOC(type, name, gfp_flags) \
type *name = kmalloc(sizeof(*name), gfp_flags)
#define NODEMASK_FREE(m) kfree(m)
--
2.13.6



2018-08-20 21:26:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix comment for NODEMASK_ALLOC

On Mon, 20 Aug 2018 10:55:16 +0200 Oscar Salvador <[email protected]> wrote:

> From: Oscar Salvador <[email protected]>
>
> Currently, NODEMASK_ALLOC allocates a nodemask_t with kmalloc when
> NODES_SHIFT is higher than 8, otherwise it declares it within the stack.
>
> The comment says that the reasoning behind this, is that nodemask_t will be
> 256 bytes when NODES_SHIFT is higher than 8, but this is not true.
> For example, NODES_SHIFT = 9 will give us a 64 bytes nodemask_t.
> Let us fix up the comment for that.
>
> Another thing is that it might make sense to let values lower than 128bytes
> be allocated in the stack.
> Although this all depends on the depth of the stack
> (and this changes from function to function), I think that 64 bytes
> is something we can easily afford.
> So we could even bump the limit by 1 (from > 8 to > 9).
>

I agree. Such a change will reduce the amount of testing which the
kmalloc version receives, but I assume there are enough people out
there testing with large NODES_SHIFT values.

And while we're looking at this, it would be nice to make NODES_SHIFT
go away. Ensure that CONFIG_NODES_SHIFT always has a setting and use
that directly.



2018-08-21 12:19:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix comment for NODEMASK_ALLOC

On Mon 20-08-18 14:24:40, Andrew Morton wrote:
> On Mon, 20 Aug 2018 10:55:16 +0200 Oscar Salvador <[email protected]> wrote:
>
> > From: Oscar Salvador <[email protected]>
> >
> > Currently, NODEMASK_ALLOC allocates a nodemask_t with kmalloc when
> > NODES_SHIFT is higher than 8, otherwise it declares it within the stack.
> >
> > The comment says that the reasoning behind this, is that nodemask_t will be
> > 256 bytes when NODES_SHIFT is higher than 8, but this is not true.
> > For example, NODES_SHIFT = 9 will give us a 64 bytes nodemask_t.
> > Let us fix up the comment for that.
> >
> > Another thing is that it might make sense to let values lower than 128bytes
> > be allocated in the stack.
> > Although this all depends on the depth of the stack
> > (and this changes from function to function), I think that 64 bytes
> > is something we can easily afford.
> > So we could even bump the limit by 1 (from > 8 to > 9).
> >
>
> I agree. Such a change will reduce the amount of testing which the
> kmalloc version receives, but I assume there are enough people out
> there testing with large NODES_SHIFT values.

We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some
time (around SLE11-SP3 AFAICS).

Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do
larger than 1024 NUMA nodes? This would be 128B and from a quick glance
it seems that none of those functions are called in deep stacks. I
haven't gone through all of them but a patch which checks them all and
removes NODES_ALLOC would be quite nice IMHO.

--
Michal Hocko
SUSE Labs

2018-08-21 12:34:03

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix comment for NODEMASK_ALLOC

On Tue, Aug 21, 2018 at 02:17:34PM +0200, Michal Hocko wrote:
> We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some
> time (around SLE11-SP3 AFAICS).
>
> Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do
> larger than 1024 NUMA nodes? This would be 128B and from a quick glance
> it seems that none of those functions are called in deep stacks. I
> haven't gone through all of them but a patch which checks them all and
> removes NODES_ALLOC would be quite nice IMHO.

No, maximum we can get is 1024 NUMA nodes.
I checked this when writing another patch [1], and since having gone
through all archs Kconfigs, CONFIG_NODES_SHIFT=10 is the limit.

NODEMASK_ALLOC gets only called from:

- unregister_mem_sect_under_nodes() (not anymore after [1])
- __nr_hugepages_store_common (This does not seem to have a deep stack, we could use a normal nodemask_t)

But is also used for NODEMASK_SCRATCH (mainly used for mempolicy):

struct nodemask_scratch {
nodemask_t mask1;
nodemask_t mask2;
};

that would make 256 bytes in case CONFIG_NODES_SHIFT=10.
I am not familiar with mempolicy code, I am not sure if we can do without that and
figure out another way to achieve the same.

[1] https://patchwork.kernel.org/patch/10566673/#22179663

--
Oscar Salvador
SUSE L3

2018-08-21 13:01:27

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix comment for NODEMASK_ALLOC

On Tue, Aug 21, 2018 at 02:51:56PM +0200, Michal Hocko wrote:
> On Tue 21-08-18 14:30:24, Oscar Salvador wrote:
> > On Tue, Aug 21, 2018 at 02:17:34PM +0200, Michal Hocko wrote:
> > > We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some
> > > time (around SLE11-SP3 AFAICS).
> > >
> > > Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do
> > > larger than 1024 NUMA nodes? This would be 128B and from a quick glance
> > > it seems that none of those functions are called in deep stacks. I
> > > haven't gone through all of them but a patch which checks them all and
> > > removes NODES_ALLOC would be quite nice IMHO.
> >
> > No, maximum we can get is 1024 NUMA nodes.
> > I checked this when writing another patch [1], and since having gone
> > through all archs Kconfigs, CONFIG_NODES_SHIFT=10 is the limit.
> >
> > NODEMASK_ALLOC gets only called from:
> >
> > - unregister_mem_sect_under_nodes() (not anymore after [1])
> > - __nr_hugepages_store_common (This does not seem to have a deep stack, we could use a normal nodemask_t)
> >
> > But is also used for NODEMASK_SCRATCH (mainly used for mempolicy):
>
> mempolicy code should be a shallow stack as well. Mostly the syscall
> entry.

Ok, then I could give it a try and see if we can get rid of NODEMASK_ALLOC in there
as well.

--
Oscar Salvador
SUSE L3

2018-08-21 13:32:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix comment for NODEMASK_ALLOC

On Tue 21-08-18 14:30:24, Oscar Salvador wrote:
> On Tue, Aug 21, 2018 at 02:17:34PM +0200, Michal Hocko wrote:
> > We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some
> > time (around SLE11-SP3 AFAICS).
> >
> > Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do
> > larger than 1024 NUMA nodes? This would be 128B and from a quick glance
> > it seems that none of those functions are called in deep stacks. I
> > haven't gone through all of them but a patch which checks them all and
> > removes NODES_ALLOC would be quite nice IMHO.
>
> No, maximum we can get is 1024 NUMA nodes.
> I checked this when writing another patch [1], and since having gone
> through all archs Kconfigs, CONFIG_NODES_SHIFT=10 is the limit.
>
> NODEMASK_ALLOC gets only called from:
>
> - unregister_mem_sect_under_nodes() (not anymore after [1])
> - __nr_hugepages_store_common (This does not seem to have a deep stack, we could use a normal nodemask_t)
>
> But is also used for NODEMASK_SCRATCH (mainly used for mempolicy):

mempolicy code should be a shallow stack as well. Mostly the syscall
entry.

--
Michal Hocko
SUSE Labs

2018-08-21 21:02:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix comment for NODEMASK_ALLOC

On Tue, 21 Aug 2018 14:30:24 +0200 Oscar Salvador <[email protected]> wrote:

> On Tue, Aug 21, 2018 at 02:17:34PM +0200, Michal Hocko wrote:
> > We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some
> > time (around SLE11-SP3 AFAICS).
> >
> > Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do
> > larger than 1024 NUMA nodes? This would be 128B and from a quick glance
> > it seems that none of those functions are called in deep stacks. I
> > haven't gone through all of them but a patch which checks them all and
> > removes NODES_ALLOC would be quite nice IMHO.
>
> No, maximum we can get is 1024 NUMA nodes.
> I checked this when writing another patch [1], and since having gone
> through all archs Kconfigs, CONFIG_NODES_SHIFT=10 is the limit.
>
> NODEMASK_ALLOC gets only called from:
>
> - unregister_mem_sect_under_nodes() (not anymore after [1])
> - __nr_hugepages_store_common (This does not seem to have a deep stack, we could use a normal nodemask_t)
>
> But is also used for NODEMASK_SCRATCH (mainly used for mempolicy):
>
> struct nodemask_scratch {
> nodemask_t mask1;
> nodemask_t mask2;
> };
>
> that would make 256 bytes in case CONFIG_NODES_SHIFT=10.

And that sole site could use an open-coded kmalloc.



2018-08-23 16:08:03

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH] mm: Fix comment for NODEMASK_ALLOC

On Tue, Aug 21, 2018 at 01:51:59PM -0700, Andrew Morton wrote:
> On Tue, 21 Aug 2018 14:30:24 +0200 Oscar Salvador <[email protected]> wrote:
>
> > On Tue, Aug 21, 2018 at 02:17:34PM +0200, Michal Hocko wrote:
> > > We do have CONFIG_NODES_SHIFT=10 in our SLES kernels for quite some
> > > time (around SLE11-SP3 AFAICS).
> > >
> > > Anyway, isn't NODES_ALLOC over engineered a bit? Does actually even do
> > > larger than 1024 NUMA nodes? This would be 128B and from a quick glance
> > > it seems that none of those functions are called in deep stacks. I
> > > haven't gone through all of them but a patch which checks them all and
> > > removes NODES_ALLOC would be quite nice IMHO.
> >
> > No, maximum we can get is 1024 NUMA nodes.
> > I checked this when writing another patch [1], and since having gone
> > through all archs Kconfigs, CONFIG_NODES_SHIFT=10 is the limit.
> >
> > NODEMASK_ALLOC gets only called from:
> >
> > - unregister_mem_sect_under_nodes() (not anymore after [1])
> > - __nr_hugepages_store_common (This does not seem to have a deep stack, we could use a normal nodemask_t)
> >
> > But is also used for NODEMASK_SCRATCH (mainly used for mempolicy):
> >
> > struct nodemask_scratch {
> > nodemask_t mask1;
> > nodemask_t mask2;
> > };
> >
> > that would make 256 bytes in case CONFIG_NODES_SHIFT=10.
>
> And that sole site could use an open-coded kmalloc.

It is not really one single place, but four:

- do_set_mempolicy()
- do_mbind()
- kernel_migrate_pages()
- mpol_shared_policy_init()

They get called in:

- do_set_mempolicy()
- From set_mempolicy syscall
- From numa_policy_init()
- From numa_default_policy()

* All above do not look like they have a deep stack, so it should
be possible to get rid of NODEMASK_SCRATCH there.

- do_mbind
- From mbind syscall

* Should be feasible here as well.

- kernel_migrate_pages()

- From migrate_pages syscall

* Again, this should be doable.

- mpol_shared_policy_init()

- From hugetlbfs_alloc_inode()
- shmem_get_inode()

* Seems doable for hugetlbfs_alloc_inode as well.
I only got to check hugetlbfs_alloc_inode, because shmem_get_inode


So it seems that this can be done in most of the places.
The only tricky function might be mpol_shared_policy_init because of shmem_get_inode.
But in that case, we could use an open-coded kmalloc there.

Thanks
--
Oscar Salvador
SUSE L3