From: Oscar Salvador <[email protected]>
v2 -> v3:
- NODEMASK_FREE can deal with NULL pointers, so do not
make it conditional (by David).
- Split up node_online's check patch (David's suggestion)
- Added Reviewed-by from Andrew and David
- Fix checkpath.pl warnings
This patchset does some cleanups and refactoring in the memory-hotplug code.
The first and the second patch are pretty straightforward, as they
only remove unused arguments/checks.
The third one refactors unregister_mem_sect_under_nodes.
This is needed to have a proper fallback in case we could not allocate
memory. (details can be seen in patch3).
The fourth patch removes a node_online check.
We are getting the nid from pages that are yet not removed, but a node
can only be offline when its memory/cpu's have been removed.
Therefore, we do not really need to check for the node to be online here.
Since this change has a patch for itself, we could quickly revert it
if we notice that something is wrong with it, or drop it if people
are concerned about it.
Oscar Salvador (4):
mm/memory-hotplug: Drop unused args from remove_memory_section
mm/memory_hotplug: Drop mem_blk check from
unregister_mem_sect_under_nodes
mm/memory_hotplug: Refactor unregister_mem_sect_under_nodes
mm/memory_hotplug: Drop node_online check in
unregister_mem_sect_under_nodes
drivers/base/memory.c | 5 ++---
drivers/base/node.c | 29 +++++++++++++++--------------
include/linux/node.h | 5 ++---
3 files changed, 19 insertions(+), 20 deletions(-)
--
2.13.6
From: Oscar Salvador <[email protected]>
unregister_memory_section() calls remove_memory_section()
with three arguments:
* node_id
* section
* phys_device
Neither node_id nor phys_device are used.
Let us drop them from the function.
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
---
drivers/base/memory.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index c8a1cb0b6136..2c622a9a7490 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -752,8 +752,7 @@ unregister_memory(struct memory_block *memory)
device_unregister(&memory->dev);
}
-static int remove_memory_section(unsigned long node_id,
- struct mem_section *section, int phys_device)
+static int remove_memory_section(struct mem_section *section)
{
struct memory_block *mem;
@@ -785,7 +784,7 @@ int unregister_memory_section(struct mem_section *section)
if (!present_section(section))
return -EINVAL;
- return remove_memory_section(0, section, 0);
+ return remove_memory_section(section);
}
#endif /* CONFIG_MEMORY_HOTREMOVE */
--
2.13.6
From: Oscar Salvador <[email protected]>
Before calling to unregister_mem_sect_under_nodes(),
remove_memory_section() already checks if we got a valid memory_block.
No need to check that again in unregister_mem_sect_under_nodes().
If more functions start using unregister_mem_sect_under_nodes() in the
future, we can always place a WARN_ON to catch null mem_blk's so we can
safely back off.
For now, let us keep the check in remove_memory_section() since it is the
only function that uses it.
Signed-off-by: Oscar Salvador <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
---
drivers/base/node.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 1ac4c36e13bb..dd3bdab230b2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -455,10 +455,6 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
unsigned long pfn, sect_start_pfn, sect_end_pfn;
- if (!mem_blk) {
- NODEMASK_FREE(unlinked_nodes);
- return -EFAULT;
- }
if (!unlinked_nodes)
return -ENOMEM;
nodes_clear(*unlinked_nodes);
--
2.13.6
From: Oscar Salvador <[email protected]>
unregister_mem_sect_under_nodes() tries to allocate a nodemask_t
in order to check whithin the loop which nodes have already been unlinked,
so we do not repeat the operation on them.
NODEMASK_ALLOC calls kmalloc() if NODES_SHIFT > 8, otherwise
it just declares a nodemask_t variable whithin the stack.
Since kamlloc() can fail, we actually check whether NODEMASK_ALLOC failed
or not, and we return -ENOMEM accordingly.
remove_memory_section() does not check for the return value though.
The problem with this is that if we return -ENOMEM, it means that
unregister_mem_sect_under_nodes will not be able to remove the symlinks,
but since we do not check the return value, we go ahead and we call
unregister_memory(), which will remove all the mem_blks directories.
This will leave us with dangled symlinks.
The easiest way to overcome this is to fallback by calling
sysfs_remove_link() unconditionally in case NODEMASK_ALLOC failed.
This means that we will call sysfs_remove_link on nodes that have been
already unlinked, but nothing wrong happens as sysfs_remove_link()
backs off somewhere down the chain in case the link has already been
removed.
I think that this is better than
a) dangled symlinks
b) having to recovery from such error in remove_memory_section
Since from now on we will not need to care about return values, we can make
the function void.
As we have a safe fallback, one thing that could also be done is to add
__GFP_NORETRY in the flags when calling NODEMASK_ALLOC, so we do not retry.
Signed-off-by: Oscar Salvador <[email protected]>
---
drivers/base/node.c | 23 +++++++++++++++--------
include/linux/node.h | 5 ++---
2 files changed, 17 insertions(+), 11 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index dd3bdab230b2..81b27b5b1f15 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -449,35 +449,42 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg)
}
/* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+void unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
unsigned long phys_index)
{
NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
unsigned long pfn, sect_start_pfn, sect_end_pfn;
- if (!unlinked_nodes)
- return -ENOMEM;
- nodes_clear(*unlinked_nodes);
+ if (unlinked_nodes)
+ nodes_clear(*unlinked_nodes);
sect_start_pfn = section_nr_to_pfn(phys_index);
sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
- int nid;
+ int nid = get_nid_for_pfn(pfn);
- nid = get_nid_for_pfn(pfn);
if (nid < 0)
continue;
if (!node_online(nid))
continue;
- if (node_test_and_set(nid, *unlinked_nodes))
+ /*
+ * It is possible that NODEMASK_ALLOC fails due to memory
+ * pressure.
+ * If that happens, we fallback to call sysfs_remove_link
+ * unconditionally.
+ * Nothing wrong will happen as sysfs_remove_link will back off
+ * somewhere down the chain in case the link has already been
+ * removed.
+ */
+ if (unlinked_nodes && node_test_and_set(nid, *unlinked_nodes))
continue;
+
sysfs_remove_link(&node_devices[nid]->dev.kobj,
kobject_name(&mem_blk->dev.kobj));
sysfs_remove_link(&mem_blk->dev.kobj,
kobject_name(&node_devices[nid]->dev.kobj));
}
NODEMASK_FREE(unlinked_nodes);
- return 0;
}
int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn)
diff --git a/include/linux/node.h b/include/linux/node.h
index 257bb3d6d014..1203378e596a 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -72,7 +72,7 @@ extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
extern int register_mem_sect_under_node(struct memory_block *mem_blk,
void *arg);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+extern void unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
unsigned long phys_index);
#ifdef CONFIG_HUGETLBFS
@@ -105,10 +105,9 @@ static inline int register_mem_sect_under_node(struct memory_block *mem_blk,
{
return 0;
}
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+static inline void unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
unsigned long phys_index)
{
- return 0;
}
static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
--
2.13.6
From: Oscar Salvador <[email protected]>
We are getting the nid from the pages that are not yet removed,
but a node can only be offline when its memory/cpu's have been removed.
Therefore, we know that the node is still online.
Signed-off-by: Oscar Salvador <[email protected]>
---
drivers/base/node.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 81b27b5b1f15..b23769e4fcbb 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -465,8 +465,6 @@ void unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
if (nid < 0)
continue;
- if (!node_online(nid))
- continue;
/*
* It is possible that NODEMASK_ALLOC fails due to memory
* pressure.
--
2.13.6
On Wed, 15 Aug 2018 16:42:18 +0200 Oscar Salvador <[email protected]> wrote:
> From: Oscar Salvador <[email protected]>
>
> unregister_mem_sect_under_nodes() tries to allocate a nodemask_t
> in order to check whithin the loop which nodes have already been unlinked,
> so we do not repeat the operation on them.
>
> NODEMASK_ALLOC calls kmalloc() if NODES_SHIFT > 8, otherwise
> it just declares a nodemask_t variable whithin the stack.
>
> Since kamlloc() can fail, we actually check whether NODEMASK_ALLOC failed
> or not, and we return -ENOMEM accordingly.
> remove_memory_section() does not check for the return value though.
>
> The problem with this is that if we return -ENOMEM, it means that
> unregister_mem_sect_under_nodes will not be able to remove the symlinks,
> but since we do not check the return value, we go ahead and we call
> unregister_memory(), which will remove all the mem_blks directories.
>
> This will leave us with dangled symlinks.
>
> The easiest way to overcome this is to fallback by calling
> sysfs_remove_link() unconditionally in case NODEMASK_ALLOC failed.
> This means that we will call sysfs_remove_link on nodes that have been
> already unlinked, but nothing wrong happens as sysfs_remove_link()
> backs off somewhere down the chain in case the link has already been
> removed.
>
> I think that this is better than
>
> a) dangled symlinks
> b) having to recovery from such error in remove_memory_section
>
> Since from now on we will not need to care about return values, we can make
> the function void.
>
> As we have a safe fallback, one thing that could also be done is to add
> __GFP_NORETRY in the flags when calling NODEMASK_ALLOC, so we do not retry.
>
Oh boy, lots of things.
That small GFP_KERNEL allocation will basically never fail. In the
exceedingly-rare-basically-never-happens case, simply bailing out of
unregister_mem_sect_under_nodes() seems acceptable. But I guess that
addressing it is a reasonable thing to do, if it can be done sanely.
But given that your new unregister_mem_sect_under_nodes() can proceed
happily even if the allocation failed, what's the point in allocating
the nodemask at all? Why not just go ahead and run sysfs_remove_link()
against possibly-absent sysfs objects every time? That produces
simpler code and avoids having this basically-never-executed-or-tested
code path in there.
Incidentally, do we have locking in place to prevent
unregister_mem_sect_under_nodes() from accidentally removing sysfs
nodes which were added 2 nanoseconds ago by a concurrent thread?
Also, this stuff in nodemask.h:
: /*
: * For nodemask scrach area.
: * NODEMASK_ALLOC(type, name) allocates an object with a specified type and
: * name.
: */
: #if NODES_SHIFT > 8 /* nodemask_t > 256 bytes */
: #define NODEMASK_ALLOC(type, name, gfp_flags) \
: type *name = kmalloc(sizeof(*name), gfp_flags)
: #define NODEMASK_FREE(m) kfree(m)
: #else
: #define NODEMASK_ALLOC(type, name, gfp_flags) type _##name, *name = &_##name
: #define NODEMASK_FREE(m) do {} while (0)
: #endif
a) s/scrach/scratch/
b) The comment is wrong, isn't it? "NODES_SHIFT > 8" means
"nodemask_t > 32 bytes"?
c) If "yes" then we can surely bump that up a bit - "NODES_SHIFT >
11", say.
d) What's the maximum number of nodes, ever? Perhaps we can always
fit a nodemask_t onto the stack, dunno.
On Wed, Aug 15, 2018 at 03:01:21PM -0700, Andrew Morton wrote:
> Oh boy, lots of things.
>
> That small GFP_KERNEL allocation will basically never fail. In the
> exceedingly-rare-basically-never-happens case, simply bailing out of
> unregister_mem_sect_under_nodes() seems acceptable. But I guess that
> addressing it is a reasonable thing to do, if it can be done sanely.
Yes, I think this can be fixed as the patch showed.
Currently, if we bail out, we will have dangled symlinks, but we do not
really need to bail out, as we can proceed anyway.
> But given that your new unregister_mem_sect_under_nodes() can proceed
> happily even if the allocation failed, what's the point in allocating
> the nodemask at all? Why not just go ahead and run sysfs_remove_link()
> against possibly-absent sysfs objects every time? That produces
> simpler code and avoids having this basically-never-executed-or-tested
> code path in there.
Unless I am mistaken, the whole point to allocate a nodemask_t there is to
try to perform as less operations as possible.
If we already unlinked a link, let us not call syfs_remove_link again,
although doing it more than once on the same node is not harmful.
I will have a small impact in performance though, as we will repeat
operations.
Of course we can get rid of the nodemask_t and just call syfs_remove_link,
but I wonder if that is a bit suboptimal.
> Incidentally, do we have locking in place to prevent
> unregister_mem_sect_under_nodes() from accidentally removing sysfs
> nodes which were added 2 nanoseconds ago by a concurrent thread?
Well, remove_memory() and add_memory_resource() is being serialized with
mem_hotplug_begin()/mem_hotplug_done().
Since registering node's on mem_blk's is done in add_memory_resource(),
and unregistering them it is done in remove_memory() patch, I think they
cannot step on each other's feet.
Although, I saw that remove_memory_section() takes mem_sysfs_mutex.
I wonder if we should do the same in link_mem_sections(), just to be on the
safe side.
> Also, this stuff in nodemask.h:
>
> : /*
> : * For nodemask scrach area.
> : * NODEMASK_ALLOC(type, name) allocates an object with a specified type and
> : * name.
> : */
> : #if NODES_SHIFT > 8 /* nodemask_t > 256 bytes */
> : #define NODEMASK_ALLOC(type, name, gfp_flags) \
> : type *name = kmalloc(sizeof(*name), gfp_flags)
> : #define NODEMASK_FREE(m) kfree(m)
> : #else
> : #define NODEMASK_ALLOC(type, name, gfp_flags) type _##name, *name = &_##name
> : #define NODEMASK_FREE(m) do {} while (0)
> : #endif
>
> a) s/scrach/scratch/
>
> b) The comment is wrong, isn't it? "NODES_SHIFT > 8" means
> "nodemask_t > 32 bytes"?
It is wrong yes.
For example, if NODES_SHIFT = 9, that makes 64 bytes.
> c) If "yes" then we can surely bump that up a bit - "NODES_SHIFT >
> 11", say.
I checked all architectures that define NODES_SHIFT in Kconfig.
The maximum we can get is NODES_SHIFT = 10, and this makes 128 bytes.
> d) What's the maximum number of nodes, ever? Perhaps we can always
> fit a nodemask_t onto the stack, dunno.
Right now, we define the maximum as NODES_SHIFT = 10, so:
1 << 10 = 1024 Maximum nodes.
Since this makes only 128 bytes, I wonder if we can just go ahead and define a nodemask_t
whithin the stack.
128 bytes is not that much, is it?
Thanks
--
Oscar Salvador
SUSE L3
On 18-08-15 16:42:17, Oscar Salvador wrote:
> From: Oscar Salvador <[email protected]>
>
> Before calling to unregister_mem_sect_under_nodes(),
> remove_memory_section() already checks if we got a valid memory_block.
>
> No need to check that again in unregister_mem_sect_under_nodes().
>
> If more functions start using unregister_mem_sect_under_nodes() in the
> future, we can always place a WARN_ON to catch null mem_blk's so we can
> safely back off.
>
> For now, let us keep the check in remove_memory_section() since it is the
> only function that uses it.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
> ---
> drivers/base/node.c | 4 ----
> 1 file changed, 4 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1ac4c36e13bb..dd3bdab230b2 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -455,10 +455,6 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
> NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
> unsigned long pfn, sect_start_pfn, sect_end_pfn;
>
> - if (!mem_blk) {
> - NODEMASK_FREE(unlinked_nodes);
> - return -EFAULT;
> - }
> if (!unlinked_nodes)
> return -ENOMEM;
> nodes_clear(*unlinked_nodes);
> --
> 2.13.6
>
> > d) What's the maximum number of nodes, ever? Perhaps we can always
> > fit a nodemask_t onto the stack, dunno.
>
> Right now, we define the maximum as NODES_SHIFT = 10, so:
>
> 1 << 10 = 1024 Maximum nodes.
>
> Since this makes only 128 bytes, I wonder if we can just go ahead and define a nodemask_t
> whithin the stack.
> 128 bytes is not that much, is it?
Yeah, sue stack here, 128b is tiny. This also will solve Andrew's point of having an untested path when alloc fails, and simplify the patch overall.
Thank you,
Pavel
On 18-08-15 16:42:16, Oscar Salvador wrote:
> From: Oscar Salvador <[email protected]>
>
> unregister_memory_section() calls remove_memory_section()
> with three arguments:
>
> * node_id
> * section
> * phys_device
>
> Neither node_id nor phys_device are used.
> Let us drop them from the function.
Looks good:
Reviewed-by: Pavel Tatashin <[email protected]>
>
> Signed-off-by: Oscar Salvador <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Reviewed-by: Andrew Morton <[email protected]>
> ---
> drivers/base/memory.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index c8a1cb0b6136..2c622a9a7490 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -752,8 +752,7 @@ unregister_memory(struct memory_block *memory)
> device_unregister(&memory->dev);
> }
>
> -static int remove_memory_section(unsigned long node_id,
> - struct mem_section *section, int phys_device)
> +static int remove_memory_section(struct mem_section *section)
> {
> struct memory_block *mem;
>
> @@ -785,7 +784,7 @@ int unregister_memory_section(struct mem_section *section)
> if (!present_section(section))
> return -EINVAL;
>
> - return remove_memory_section(0, section, 0);
> + return remove_memory_section(section);
> }
> #endif /* CONFIG_MEMORY_HOTREMOVE */
>
> --
> 2.13.6
>
On 15.08.2018 16:42, Oscar Salvador wrote:
> From: Oscar Salvador <[email protected]>
>
> We are getting the nid from the pages that are not yet removed,
> but a node can only be offline when its memory/cpu's have been removed.
> Therefore, we know that the node is still online.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> drivers/base/node.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 81b27b5b1f15..b23769e4fcbb 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -465,8 +465,6 @@ void unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>
> if (nid < 0)
> continue;
> - if (!node_online(nid))
> - continue;
> /*
> * It is possible that NODEMASK_ALLOC fails due to memory
> * pressure.
>
Sounds reasonable to me
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On 15.08.2018 16:42, Oscar Salvador wrote:
> From: Oscar Salvador <[email protected]>
>
> Before calling to unregister_mem_sect_under_nodes(),
> remove_memory_section() already checks if we got a valid memory_block.
>
> No need to check that again in unregister_mem_sect_under_nodes().
>
> If more functions start using unregister_mem_sect_under_nodes() in the
> future, we can always place a WARN_ON to catch null mem_blk's so we can
> safely back off.
>
> For now, let us keep the check in remove_memory_section() since it is the
> only function that uses it.
>
> Signed-off-by: Oscar Salvador <[email protected]>
> Reviewed-by: Andrew Morton <[email protected]>
> ---
> drivers/base/node.c | 4 ----
> 1 file changed, 4 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1ac4c36e13bb..dd3bdab230b2 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -455,10 +455,6 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
> NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
> unsigned long pfn, sect_start_pfn, sect_end_pfn;
>
> - if (!mem_blk) {
> - NODEMASK_FREE(unlinked_nodes);
> - return -EFAULT;
> - }
> if (!unlinked_nodes)
> return -ENOMEM;
> nodes_clear(*unlinked_nodes);
>
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On 18-08-15 16:42:19, Oscar Salvador wrote:
> From: Oscar Salvador <[email protected]>
>
> We are getting the nid from the pages that are not yet removed,
> but a node can only be offline when its memory/cpu's have been removed.
> Therefore, we know that the node is still online.
Reviewed-by: Pavel Tatashin <[email protected]>
>
> Signed-off-by: Oscar Salvador <[email protected]>
> ---
> drivers/base/node.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 81b27b5b1f15..b23769e4fcbb 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -465,8 +465,6 @@ void unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>
> if (nid < 0)
> continue;
> - if (!node_online(nid))
> - continue;
> /*
> * It is possible that NODEMASK_ALLOC fails due to memory
> * pressure.
> --
> 2.13.6
>