Should address all earlier comments (except for the funny cpuset
case which I chose to declare a don't do that)
Also this time hopefully without missing patches.
There are still some other issues with memory hotadd, but that's the
current slab set.
The patches are against 2.6.32, but apply to mainline I believe.
-Andi
No functional changes.
Signed-off-by: Andi Kleen <[email protected]>
---
mm/slab.c | 34 +++++++++++++++++++++-------------
1 file changed, 21 insertions(+), 13 deletions(-)
Index: linux-2.6.32-memhotadd/mm/slab.c
===================================================================
--- linux-2.6.32-memhotadd.orig/mm/slab.c
+++ linux-2.6.32-memhotadd/mm/slab.c
@@ -1158,19 +1158,9 @@ free_array_cache:
}
}
-static int __cpuinit cpuup_prepare(long cpu)
+static int slab_node_prepare(int node)
{
struct kmem_cache *cachep;
- struct kmem_list3 *l3 = NULL;
- int node = cpu_to_node(cpu);
- const int memsize = sizeof(struct kmem_list3);
-
- /*
- * We need to do this right in the beginning since
- * alloc_arraycache's are going to use this list.
- * kmalloc_node allows us to add the slab to the right
- * kmem_list3 and not this cpu's kmem_list3
- */
list_for_each_entry(cachep, &cache_chain, next) {
/*
@@ -1179,9 +1169,10 @@ static int __cpuinit cpuup_prepare(long
* node has not already allocated this
*/
if (!cachep->nodelists[node]) {
- l3 = kmalloc_node(memsize, GFP_KERNEL, node);
+ struct kmem_list3 *l3;
+ l3 = kmalloc_node(sizeof(struct kmem_list3), GFP_KERNEL, node);
if (!l3)
- goto bad;
+ return -1;
kmem_list3_init(l3);
l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
((unsigned long)cachep) % REAPTIMEOUT_LIST3;
@@ -1200,6 +1191,23 @@ static int __cpuinit cpuup_prepare(long
cachep->batchcount + cachep->num;
spin_unlock_irq(&cachep->nodelists[node]->list_lock);
}
+ return 0;
+}
+
+static int __cpuinit cpuup_prepare(long cpu)
+{
+ struct kmem_cache *cachep;
+ struct kmem_list3 *l3 = NULL;
+ int node = cpu_to_node(cpu);
+
+ /*
+ * We need to do this right in the beginning since
+ * alloc_arraycache's are going to use this list.
+ * kmalloc_node allows us to add the slab to the right
+ * kmem_list3 and not this cpu's kmem_list3
+ */
+ if (slab_node_prepare(node) < 0)
+ goto bad;
/*
* Now we can go ahead with allocating the shared arrays and
So kmalloc_node() works even if no CPU is up yet on the new node.
v2: Take cache chain mutex
Signed-off-by: Andi Kleen <[email protected]>
---
mm/slab.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
Index: linux-2.6.32-memhotadd/mm/slab.c
===================================================================
--- linux-2.6.32-memhotadd.orig/mm/slab.c
+++ linux-2.6.32-memhotadd/mm/slab.c
@@ -115,6 +115,7 @@
#include <linux/reciprocal_div.h>
#include <linux/debugobjects.h>
#include <linux/kmemcheck.h>
+#include <linux/memory.h>
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
@@ -1554,6 +1555,23 @@ void __init kmem_cache_init(void)
g_cpucache_up = EARLY;
}
+static int slab_memory_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ struct memory_notify *mn = (struct memory_notify *)arg;
+
+ /*
+ * When a node goes online allocate l3s early. This way
+ * kmalloc_node() works for it.
+ */
+ if (action == MEM_ONLINE && mn->status_change_nid >= 0) {
+ mutex_lock(&cache_chain_mutex);
+ slab_node_prepare(mn->status_change_nid);
+ mutex_unlock(&cache_chain_mutex);
+ }
+ return NOTIFY_OK;
+}
+
void __init kmem_cache_init_late(void)
{
struct kmem_cache *cachep;
@@ -1577,6 +1595,8 @@ void __init kmem_cache_init_late(void)
*/
register_cpu_notifier(&cpucache_notifier);
+ hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
+
/*
* The reap timers are started later, with a module init call: That part
* of the kernel is not yet operational.
cache_reap can run before the node is set up and then reference a NULL
l3 list. Check for this explicitely and just continue. The node
will be eventually set up.
Signed-off-by: Andi Kleen <[email protected]>
---
mm/slab.c | 3 +++
1 file changed, 3 insertions(+)
Index: linux-2.6.32-memhotadd/mm/slab.c
===================================================================
--- linux-2.6.32-memhotadd.orig/mm/slab.c
+++ linux-2.6.32-memhotadd/mm/slab.c
@@ -4093,6 +4093,9 @@ static void cache_reap(struct work_struc
* we can do some work if the lock was obtained.
*/
l3 = searchp->nodelists[node];
+ /* Note node yet set up */
+ if (!l3)
+ break;
reap_alien(searchp, l3);
When fallback_alloc() runs the node of the CPU might not be initialized yet.
Handle this case by allocating in another node.
v2: Try to allocate from all nodes (David Rientjes)
Signed-off-by: Andi Kleen <[email protected]>
---
mm/slab.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
Index: linux-2.6.32-memhotadd/mm/slab.c
===================================================================
--- linux-2.6.32-memhotadd.orig/mm/slab.c
+++ linux-2.6.32-memhotadd/mm/slab.c
@@ -3188,7 +3188,24 @@ retry:
if (local_flags & __GFP_WAIT)
local_irq_enable();
kmem_flagcheck(cache, flags);
- obj = kmem_getpages(cache, local_flags, numa_node_id());
+
+ /*
+ * Node not set up yet? Try one that the cache has been set up
+ * for.
+ */
+ nid = numa_node_id();
+ if (cache->nodelists[nid] == NULL) {
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+ nid = zone_to_nid(zone);
+ if (cache->nodelists[nid]) {
+ obj = kmem_getpages(cache, local_flags, nid);
+ if (obj)
+ break;
+ }
+ }
+ } else
+ obj = kmem_getpages(cache, local_flags, nid);
+
if (local_flags & __GFP_WAIT)
local_irq_disable();
if (obj) {
On Thu, 11 Feb 2010, Andi Kleen wrote:
> When fallback_alloc() runs the node of the CPU might not be initialized yet.
> Handle this case by allocating in another node.
>
> v2: Try to allocate from all nodes (David Rientjes)
>
You don't need to specifically address the cpuset restriction in
fallback_alloc() since kmem_getpages() will return NULL whenever a zone is
tried from an unallowed node, I just thought it was a faster optimization
considering you (i) would operate over a nodemask and not the entire
zonelist, (ii) it would avoid the zone_to_nid() for all zones since you
already did a zonelist iteration in this function, and (iii) it wouldn't
needlessly call kmem_getpages() for unallowed nodes.
> Signed-off-by: Andi Kleen <[email protected]>
That said, I don't want to see this fix go unmerged since you already
declined to make that optimization once:
Acked-by: David Rientjes <[email protected]>
On Thu, 11 Feb 2010, Andi Kleen wrote:
> Index: linux-2.6.32-memhotadd/mm/slab.c
> ===================================================================
> --- linux-2.6.32-memhotadd.orig/mm/slab.c
> +++ linux-2.6.32-memhotadd/mm/slab.c
> @@ -1158,19 +1158,9 @@ free_array_cache:
> }
> }
>
> -static int __cpuinit cpuup_prepare(long cpu)
> +static int slab_node_prepare(int node)
I still think this deserves a comment saying that slab_node_prepare()
requires cache_chain_mutex, I'm not sure people interested in node hotadd
would be concerned with whether the implementation needs to iterate slab
caches or not.
Otherwise:
Acked-by: David Rientjes <[email protected]>
On Thu, 11 Feb 2010, Andi Kleen wrote:
> Index: linux-2.6.32-memhotadd/mm/slab.c
> ===================================================================
> --- linux-2.6.32-memhotadd.orig/mm/slab.c
> +++ linux-2.6.32-memhotadd/mm/slab.c
> @@ -115,6 +115,7 @@
> #include <linux/reciprocal_div.h>
> #include <linux/debugobjects.h>
> #include <linux/kmemcheck.h>
> +#include <linux/memory.h>
>
> #include <asm/cacheflush.h>
> #include <asm/tlbflush.h>
> @@ -1554,6 +1555,23 @@ void __init kmem_cache_init(void)
> g_cpucache_up = EARLY;
> }
>
> +static int slab_memory_callback(struct notifier_block *self,
> + unsigned long action, void *arg)
> +{
> + struct memory_notify *mn = (struct memory_notify *)arg;
> +
> + /*
> + * When a node goes online allocate l3s early. This way
> + * kmalloc_node() works for it.
> + */
> + if (action == MEM_ONLINE && mn->status_change_nid >= 0) {
> + mutex_lock(&cache_chain_mutex);
> + slab_node_prepare(mn->status_change_nid);
> + mutex_unlock(&cache_chain_mutex);
> + }
> + return NOTIFY_OK;
> +}
> +
> void __init kmem_cache_init_late(void)
> {
> struct kmem_cache *cachep;
> @@ -1577,6 +1595,8 @@ void __init kmem_cache_init_late(void)
> */
> register_cpu_notifier(&cpucache_notifier);
>
> + hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
> +
Only needed for CONFIG_NUMA, but there's no side-effects for UMA kernels
since status_change_nid will always be -1.
Acked-by: David Rientjes <[email protected]>
On Thu, 11 Feb 2010, Andi Kleen wrote:
>
> cache_reap can run before the node is set up and then reference a NULL
> l3 list. Check for this explicitely and just continue. The node
> will be eventually set up.
>
> Signed-off-by: Andi Kleen <[email protected]>
Acked-by: David Rientjes <[email protected]>
On Thu, Feb 11, 2010 at 01:41:53PM -0800, David Rientjes wrote:
> On Thu, 11 Feb 2010, Andi Kleen wrote:
>
> > When fallback_alloc() runs the node of the CPU might not be initialized yet.
> > Handle this case by allocating in another node.
> >
> > v2: Try to allocate from all nodes (David Rientjes)
> >
>
> You don't need to specifically address the cpuset restriction in
> fallback_alloc() since kmem_getpages() will return NULL whenever a zone is
> tried from an unallowed node, I just thought it was a faster optimization
> considering you (i) would operate over a nodemask and not the entire
> zonelist, (ii) it would avoid the zone_to_nid() for all zones since you
> already did a zonelist iteration in this function, and (iii) it wouldn't
> needlessly call kmem_getpages() for unallowed nodes.
Thanks for the review again.
I don't really care about performance at all for this, this is just for
a few allocations during the memory hotadd path.
-Andi
--
[email protected] -- Speaking for myself only.
Andi Kleen wrote:
> Should address all earlier comments (except for the funny cpuset
> case which I chose to declare a don't do that)
>
> Also this time hopefully without missing patches.
>
> There are still some other issues with memory hotadd, but that's the
> current slab set.
>
> The patches are against 2.6.32, but apply to mainline I believe.
The series has been applied and will appear in the next version of
linux-next.
On Thu, Feb 11, 2010 at 09:54:00PM +0100, Andi Kleen wrote:
>
> When fallback_alloc() runs the node of the CPU might not be initialized yet.
> Handle this case by allocating in another node.
>
> v2: Try to allocate from all nodes (David Rientjes)
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/slab.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.32-memhotadd/mm/slab.c
> ===================================================================
> --- linux-2.6.32-memhotadd.orig/mm/slab.c
> +++ linux-2.6.32-memhotadd/mm/slab.c
> @@ -3188,7 +3188,24 @@ retry:
> if (local_flags & __GFP_WAIT)
> local_irq_enable();
> kmem_flagcheck(cache, flags);
> - obj = kmem_getpages(cache, local_flags, numa_node_id());
> +
> + /*
> + * Node not set up yet? Try one that the cache has been set up
> + * for.
> + */
> + nid = numa_node_id();
> + if (cache->nodelists[nid] == NULL) {
> + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> + nid = zone_to_nid(zone);
> + if (cache->nodelists[nid]) {
> + obj = kmem_getpages(cache, local_flags, nid);
> + if (obj)
> + break;
> + }
> + }
> + } else
> + obj = kmem_getpages(cache, local_flags, nid);
> +
> if (local_flags & __GFP_WAIT)
> local_irq_disable();
> if (obj) {
This is a better way to go anyway because it really is a proper
"fallback" alloc. I think that possibly used to work (ie. kmem_getpages
would be able to pass -1 for the node there) but got broken along the
line.
Although it's not such a hot path to begin with, care to put a branch
annotation there?
Acked-by: Nick Piggin <[email protected]>
On Thu, Feb 11, 2010 at 01:45:16PM -0800, David Rientjes wrote:
> On Thu, 11 Feb 2010, Andi Kleen wrote:
>
> > Index: linux-2.6.32-memhotadd/mm/slab.c
> > ===================================================================
> > --- linux-2.6.32-memhotadd.orig/mm/slab.c
> > +++ linux-2.6.32-memhotadd/mm/slab.c
> > @@ -115,6 +115,7 @@
> > #include <linux/reciprocal_div.h>
> > #include <linux/debugobjects.h>
> > #include <linux/kmemcheck.h>
> > +#include <linux/memory.h>
> >
> > #include <asm/cacheflush.h>
> > #include <asm/tlbflush.h>
> > @@ -1554,6 +1555,23 @@ void __init kmem_cache_init(void)
> > g_cpucache_up = EARLY;
> > }
> >
> > +static int slab_memory_callback(struct notifier_block *self,
> > + unsigned long action, void *arg)
> > +{
> > + struct memory_notify *mn = (struct memory_notify *)arg;
> > +
> > + /*
> > + * When a node goes online allocate l3s early. This way
> > + * kmalloc_node() works for it.
> > + */
> > + if (action == MEM_ONLINE && mn->status_change_nid >= 0) {
> > + mutex_lock(&cache_chain_mutex);
> > + slab_node_prepare(mn->status_change_nid);
> > + mutex_unlock(&cache_chain_mutex);
> > + }
> > + return NOTIFY_OK;
> > +}
> > +
> > void __init kmem_cache_init_late(void)
> > {
> > struct kmem_cache *cachep;
> > @@ -1577,6 +1595,8 @@ void __init kmem_cache_init_late(void)
> > */
> > register_cpu_notifier(&cpucache_notifier);
> >
> > + hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
> > +
>
> Only needed for CONFIG_NUMA, but there's no side-effects for UMA kernels
> since status_change_nid will always be -1.
Compiler doesn't know that, though.
>
> Acked-by: David Rientjes <[email protected]>
On Thu, Feb 11, 2010 at 09:54:04PM +0100, Andi Kleen wrote:
>
> cache_reap can run before the node is set up and then reference a NULL
> l3 list. Check for this explicitely and just continue. The node
> will be eventually set up.
How, may I ask? cpuup_prepare in the hotplug notifier should always
run before start_cpu_timer.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/slab.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> Index: linux-2.6.32-memhotadd/mm/slab.c
> ===================================================================
> --- linux-2.6.32-memhotadd.orig/mm/slab.c
> +++ linux-2.6.32-memhotadd/mm/slab.c
> @@ -4093,6 +4093,9 @@ static void cache_reap(struct work_struc
> * we can do some work if the lock was obtained.
> */
> l3 = searchp->nodelists[node];
> + /* Note node yet set up */
> + if (!l3)
> + break;
>
> reap_alien(searchp, l3);
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> This is a better way to go anyway because it really is a proper
> "fallback" alloc. I think that possibly used to work (ie. kmem_getpages
> would be able to pass -1 for the node there) but got broken along the
> line.
Thanks for the review.
I should add there's still one open problem: in some cases
the oom killer kicks in on hotadd. Still working on that one.
In general hotadd was mighty bitrotted :/
>
> Although it's not such a hot path to begin with, care to put a branch
> annotation there?
pointer == NULL is already default unlikely in gcc
/* Pointers are usually not NULL. */
DEF_PREDICTOR (PRED_POINTER, "pointer", HITRATE (85), 0)
DEF_PREDICTOR (PRED_TREE_POINTER, "pointer (on trees)", HITRATE (85), 0)
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Feb 15, 2010 at 11:07:12AM +0100, Andi Kleen wrote:
> > This is a better way to go anyway because it really is a proper
> > "fallback" alloc. I think that possibly used to work (ie. kmem_getpages
> > would be able to pass -1 for the node there) but got broken along the
> > line.
>
> Thanks for the review.
>
> I should add there's still one open problem: in some cases
> the oom killer kicks in on hotadd. Still working on that one.
>
> In general hotadd was mighty bitrotted :/
Yes, that doesn't surprise me. I'm sure you can handle it, but send
some traces if you have problems.
> > Although it's not such a hot path to begin with, care to put a branch
> > annotation there?
>
> pointer == NULL is already default unlikely in gcc
>
> /* Pointers are usually not NULL. */
> DEF_PREDICTOR (PRED_POINTER, "pointer", HITRATE (85), 0)
> DEF_PREDICTOR (PRED_TREE_POINTER, "pointer (on trees)", HITRATE (85), 0)
Well I still prefer to annotate it. I think builtin expect is 99%.
On Mon, Feb 15, 2010 at 05:15:35PM +1100, Nick Piggin wrote:
> On Thu, Feb 11, 2010 at 09:54:04PM +0100, Andi Kleen wrote:
> >
> > cache_reap can run before the node is set up and then reference a NULL
> > l3 list. Check for this explicitely and just continue. The node
> > will be eventually set up.
>
> How, may I ask? cpuup_prepare in the hotplug notifier should always
> run before start_cpu_timer.
I'm not fully sure, but I have the oops to prove it :)
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Feb 15, 2010 at 11:32:50AM +0100, Andi Kleen wrote:
> On Mon, Feb 15, 2010 at 05:15:35PM +1100, Nick Piggin wrote:
> > On Thu, Feb 11, 2010 at 09:54:04PM +0100, Andi Kleen wrote:
> > >
> > > cache_reap can run before the node is set up and then reference a NULL
> > > l3 list. Check for this explicitely and just continue. The node
> > > will be eventually set up.
> >
> > How, may I ask? cpuup_prepare in the hotplug notifier should always
> > run before start_cpu_timer.
>
> I'm not fully sure, but I have the oops to prove it :)
Hmm, it would be nice to work out why it's happening. If it's completely
reproducible then could I send you a debug patch to test?
On Mon, Feb 15, 2010 at 09:41:35PM +1100, Nick Piggin wrote:
> On Mon, Feb 15, 2010 at 11:32:50AM +0100, Andi Kleen wrote:
> > On Mon, Feb 15, 2010 at 05:15:35PM +1100, Nick Piggin wrote:
> > > On Thu, Feb 11, 2010 at 09:54:04PM +0100, Andi Kleen wrote:
> > > >
> > > > cache_reap can run before the node is set up and then reference a NULL
> > > > l3 list. Check for this explicitely and just continue. The node
> > > > will be eventually set up.
> > >
> > > How, may I ask? cpuup_prepare in the hotplug notifier should always
> > > run before start_cpu_timer.
> >
> > I'm not fully sure, but I have the oops to prove it :)
>
> Hmm, it would be nice to work out why it's happening. If it's completely
> reproducible then could I send you a debug patch to test?
Looking at it again I suspect it happened this way:
cpuup_prepare fails (e.g. kmalloc_node returns NULL). The later
patches might have cured that. Nothing stops the timer from
starting in this case anyways.
So given that the first patches might not be needed, but it's
safer to have anyways.
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Feb 15, 2010 at 11:52:53AM +0100, Andi Kleen wrote:
> On Mon, Feb 15, 2010 at 09:41:35PM +1100, Nick Piggin wrote:
> > On Mon, Feb 15, 2010 at 11:32:50AM +0100, Andi Kleen wrote:
> > > On Mon, Feb 15, 2010 at 05:15:35PM +1100, Nick Piggin wrote:
> > > > On Thu, Feb 11, 2010 at 09:54:04PM +0100, Andi Kleen wrote:
> > > > >
> > > > > cache_reap can run before the node is set up and then reference a NULL
> > > > > l3 list. Check for this explicitely and just continue. The node
> > > > > will be eventually set up.
> > > >
> > > > How, may I ask? cpuup_prepare in the hotplug notifier should always
> > > > run before start_cpu_timer.
> > >
> > > I'm not fully sure, but I have the oops to prove it :)
> >
> > Hmm, it would be nice to work out why it's happening. If it's completely
> > reproducible then could I send you a debug patch to test?
>
> Looking at it again I suspect it happened this way:
>
> cpuup_prepare fails (e.g. kmalloc_node returns NULL). The later
> patches might have cured that. Nothing stops the timer from
> starting in this case anyways.
Hmm, but it should, because if cpuup_prepare fails then the
CPU_ONLINE notifiers should never be called I think.
> So given that the first patches might not be needed, but it's
> safer to have anyways.
I'm just worried there is still an underlying problem here.
Nick Piggin <[email protected]> writes:
>
> Hmm, but it should, because if cpuup_prepare fails then the
> CPU_ONLINE notifiers should never be called I think.
That's true.
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, 15 Feb 2010, Nick Piggin wrote:
> > > @@ -1577,6 +1595,8 @@ void __init kmem_cache_init_late(void)
> > > */
> > > register_cpu_notifier(&cpucache_notifier);
> > >
> > > + hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
> > > +
> >
> > Only needed for CONFIG_NUMA, but there's no side-effects for UMA kernels
> > since status_change_nid will always be -1.
>
> Compiler doesn't know that, though.
>
Right, setting up a memory hotplug callback for UMA kernels here isn't
necessary although slab_node_prepare() would have to be defined
unconditionally. I made this suggestion in my review of the patchset's
initial version but it was left unchanged, so I'd rather see it included
than otherwise stall out. This could always be enclosed in
#ifdef CONFIG_NUMA later just like the callback in slub does.
On Mon, Feb 15, 2010 at 01:47:29PM -0800, David Rientjes wrote:
> On Mon, 15 Feb 2010, Nick Piggin wrote:
>
> > > > @@ -1577,6 +1595,8 @@ void __init kmem_cache_init_late(void)
> > > > */
> > > > register_cpu_notifier(&cpucache_notifier);
> > > >
> > > > + hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
> > > > +
> > >
> > > Only needed for CONFIG_NUMA, but there's no side-effects for UMA kernels
> > > since status_change_nid will always be -1.
> >
> > Compiler doesn't know that, though.
> >
>
> Right, setting up a memory hotplug callback for UMA kernels here isn't
> necessary although slab_node_prepare() would have to be defined
> unconditionally. I made this suggestion in my review of the patchset's
> initial version but it was left unchanged, so I'd rather see it included
> than otherwise stall out. This could always be enclosed in
> #ifdef CONFIG_NUMA later just like the callback in slub does.
It's not such a big burden to annotate critical core code with such
things. Otherwise someone else ends up eventually doing it.
Nick Piggin wrote:
> On Mon, Feb 15, 2010 at 01:47:29PM -0800, David Rientjes wrote:
>> On Mon, 15 Feb 2010, Nick Piggin wrote:
>>
>>>>> @@ -1577,6 +1595,8 @@ void __init kmem_cache_init_late(void)
>>>>> */
>>>>> register_cpu_notifier(&cpucache_notifier);
>>>>>
>>>>> + hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
>>>>> +
>>>> Only needed for CONFIG_NUMA, but there's no side-effects for UMA kernels
>>>> since status_change_nid will always be -1.
>>> Compiler doesn't know that, though.
>>>
>> Right, setting up a memory hotplug callback for UMA kernels here isn't
>> necessary although slab_node_prepare() would have to be defined
>> unconditionally. I made this suggestion in my review of the patchset's
>> initial version but it was left unchanged, so I'd rather see it included
>> than otherwise stall out. This could always be enclosed in
>> #ifdef CONFIG_NUMA later just like the callback in slub does.
>
> It's not such a big burden to annotate critical core code with such
> things. Otherwise someone else ends up eventually doing it.
Yes, please.
On Mon, 15 Feb 2010, Andi Kleen wrote:
> > How, may I ask? cpuup_prepare in the hotplug notifier should always
> > run before start_cpu_timer.
>
> I'm not fully sure, but I have the oops to prove it :)
I still suspect that this has something to do with Pekka's changing the
boot order for allocator bootstrap. Can we clarify why these problems
exist before we try band aid?
On Mon, 15 Feb 2010, Nick Piggin wrote:
> I'm just worried there is still an underlying problem here.
So am I. What caused the breakage that requires this patchset?
On Fri, Feb 19, 2010 at 12:22:58PM -0600, Christoph Lameter wrote:
> On Mon, 15 Feb 2010, Nick Piggin wrote:
>
> > I'm just worried there is still an underlying problem here.
>
> So am I. What caused the breakage that requires this patchset?
Memory hotadd with a new node being onlined.
-Andi
--
[email protected] -- Speaking for myself only.
Andi Kleen kirjoitti:
> On Fri, Feb 19, 2010 at 12:22:58PM -0600, Christoph Lameter wrote:
>> On Mon, 15 Feb 2010, Nick Piggin wrote:
>>
>>> I'm just worried there is still an underlying problem here.
>> So am I. What caused the breakage that requires this patchset?
>
> Memory hotadd with a new node being onlined.
So can you post the oops, please? Right now I am looking at zapping the
series from slab.git due to NAKs from both Christoph and Nick.
Pekka
Christoph Lameter kirjoitti:
> On Mon, 15 Feb 2010, Andi Kleen wrote:
>
>>> How, may I ask? cpuup_prepare in the hotplug notifier should always
>>> run before start_cpu_timer.
>> I'm not fully sure, but I have the oops to prove it :)
>
> I still suspect that this has something to do with Pekka's changing the
> boot order for allocator bootstrap. Can we clarify why these problems
> exist before we try band aid?
I don't see how my changes broke things but maybe I'm not looking hard
enough. Cache reaping is still setup from cpucache_init() which is an
initcall which is not affected by my changes AFAICT and from
cpuup_callback() which shoulda also not be affected.
Pekka
On Mon, Feb 22, 2010 at 12:53:27PM +0200, Pekka Enberg wrote:
> Andi Kleen kirjoitti:
>> On Fri, Feb 19, 2010 at 12:22:58PM -0600, Christoph Lameter wrote:
>>> On Mon, 15 Feb 2010, Nick Piggin wrote:
>>>
>>>> I'm just worried there is still an underlying problem here.
>>> So am I. What caused the breakage that requires this patchset?
>>
>> Memory hotadd with a new node being onlined.
>
> So can you post the oops, please? Right now I am looking at zapping the
I can't post the oops from a pre-release system.
> series from slab.git due to NAKs from both Christoph and Nick.
Huh? They just complained about the patch, not the whole series.
I don't understand how that could prompt you to drop the whole series.
As far as I know nobody said the patch is wrong so far, just
that they wanted to have more analysis.
-Andi
--
[email protected] -- Speaking for myself only.
Andi Kleen wrote:
> On Mon, Feb 22, 2010 at 12:53:27PM +0200, Pekka Enberg wrote:
>> Andi Kleen kirjoitti:
>>> On Fri, Feb 19, 2010 at 12:22:58PM -0600, Christoph Lameter wrote:
>>>> On Mon, 15 Feb 2010, Nick Piggin wrote:
>>>>
>>>>> I'm just worried there is still an underlying problem here.
>>>> So am I. What caused the breakage that requires this patchset?
>>> Memory hotadd with a new node being onlined.
>> So can you post the oops, please? Right now I am looking at zapping the
>
> I can't post the oops from a pre-release system.
>
>> series from slab.git due to NAKs from both Christoph and Nick.
>
> Huh? They just complained about the patch, not the whole series.
> I don't understand how that could prompt you to drop the whole series.
Yeah, I meant the non-ACK'd patches. Sorry for the confusion.
On Mon, Feb 22, 2010 at 06:11:03PM +0200, Pekka Enberg wrote:
> Andi Kleen wrote:
>> On Mon, Feb 22, 2010 at 12:53:27PM +0200, Pekka Enberg wrote:
>>> Andi Kleen kirjoitti:
>>>> On Fri, Feb 19, 2010 at 12:22:58PM -0600, Christoph Lameter wrote:
>>>>> On Mon, 15 Feb 2010, Nick Piggin wrote:
>>>>>
>>>>>> I'm just worried there is still an underlying problem here.
>>>>> So am I. What caused the breakage that requires this patchset?
>>>> Memory hotadd with a new node being onlined.
>>> So can you post the oops, please? Right now I am looking at zapping the
>>
>> I can't post the oops from a pre-release system.
>>
>>> series from slab.git due to NAKs from both Christoph and Nick.
>>
>> Huh? They just complained about the patch, not the whole series.
>> I don't understand how that could prompt you to drop the whole series.
>
> Yeah, I meant the non-ACK'd patches. Sorry for the confusion.
Ok it's fine for me to drop that patch for now. I'll try to reproduce
that oops and if I can't then it might be just not needed.
-Andi
--
[email protected] -- Speaking for myself only.
On Sat, 20 Feb 2010, Andi Kleen wrote:
> On Fri, Feb 19, 2010 at 12:22:58PM -0600, Christoph Lameter wrote:
> > On Mon, 15 Feb 2010, Nick Piggin wrote:
> >
> > > I'm just worried there is still an underlying problem here.
> >
> > So am I. What caused the breakage that requires this patchset?
>
> Memory hotadd with a new node being onlined.
That used to work fine.
Christoph Lameter wrote:
> On Sat, 20 Feb 2010, Andi Kleen wrote:
>
>> On Fri, Feb 19, 2010 at 12:22:58PM -0600, Christoph Lameter wrote:
>>> On Mon, 15 Feb 2010, Nick Piggin wrote:
>>>
>>>> I'm just worried there is still an underlying problem here.
>>> So am I. What caused the breakage that requires this patchset?
>> Memory hotadd with a new node being onlined.
>
> That used to work fine.
OK, can we get this issue resolved? The merge window is open and
Christoph seems to be unhappy with the whole patch queue. I'd hate this
bug fix to miss .34...
Pekka
On Thu, 25 Feb 2010, Pekka Enberg wrote:
> > > > > I'm just worried there is still an underlying problem here.
> > > > So am I. What caused the breakage that requires this patchset?
> > > Memory hotadd with a new node being onlined.
> >
> > That used to work fine.
>
> OK, can we get this issue resolved? The merge window is open and Christoph
> seems to be unhappy with the whole patch queue. I'd hate this bug fix to miss
> .34...
>
I don't see how memory hotadd with a new node being onlined could have
worked fine before since slab lacked any memory hotplug notifier until
Andi just added it.
That said, I think the first and fourth patch in this series may be
unnecessary if slab's notifier were to call slab_node_prepare() on
MEM_GOING_ONLINE instead of MEM_ONLINE. Otherwise, kswapd is already
running, the zonelists for the new pgdat have been initialized, and the
bit has been set in node_states[N_HIGH_MEMORY] without allocated
cachep->nodelists[node] memory.
On Thu, 25 Feb 2010, David Rientjes wrote:
> I don't see how memory hotadd with a new node being onlined could have
> worked fine before since slab lacked any memory hotplug notifier until
> Andi just added it.
AFAICR The cpu notifier took on that role in the past.
If what you say is true then memory hotplug has never worked before.
Kamesan?
On Thu, 25 Feb 2010, Pekka Enberg wrote:
> OK, can we get this issue resolved? The merge window is open and Christoph
> seems to be unhappy with the whole patch queue. I'd hate this bug fix to miss
> .34...
Merge window? These are bugs that have to be fixed independently from a
merge window. The question is if this is the right approach or if there is
other stuff still lurking because we are not yet seeing the full picture.
Can we get some of the hotplug authors involved in the discussion?
Hi Christoph,
Christoph Lameter wrote:
>> OK, can we get this issue resolved? The merge window is open and Christoph
>> seems to be unhappy with the whole patch queue. I'd hate this bug fix to miss
>> .34...
>
> Merge window? These are bugs that have to be fixed independently from a
> merge window. The question is if this is the right approach or if there is
> other stuff still lurking because we are not yet seeing the full picture.
The first set of patches from Andi are almost one month old. If this
issue progresses as swiftly as it has to this day, I foresee a rocky
road for any of them getting merged to .34 through slab.git, that's all.
Pekka
On Thu, 25 Feb 2010, Pekka Enberg wrote:
> The first set of patches from Andi are almost one month old. If this issue
> progresses as swiftly as it has to this day, I foresee a rocky road for any of
> them getting merged to .34 through slab.git, that's all.
Onlining and offlining memory is not that frequently used.
On Thu, 25 Feb 2010, Christoph Lameter wrote:
> > I don't see how memory hotadd with a new node being onlined could have
> > worked fine before since slab lacked any memory hotplug notifier until
> > Andi just added it.
>
> AFAICR The cpu notifier took on that role in the past.
>
The cpu notifier isn't involved if the firmware notifies the kernel that a
new ACPI memory device has been added or you write a start address to
/sys/devices/system/memory/probe. Hot-added memory devices can include
ACPI_SRAT_MEM_HOT_PLUGGABLE entries in the SRAT for x86 that assign them
non-online node ids (although all such entries get their bits set in
node_possible_map at boot), so a new pgdat may be allocated for the node's
registered range.
Slab isn't concerned about that until the memory is onlined by doing
echo online > /sys/devices/system/memory/memoryX/state for the new memory
section. This is where all the new pages are onlined, kswapd is started
on the new node, and the zonelists are built. It's also where the new
node gets set in N_HIGH_MEMORY and, thus, it's possible to call
kmalloc_node() in generic kernel code. All that is done under
MEM_GOING_ONLINE and not MEM_ONLINE, which is why I suggest the first and
fourth patch in this series may not be necessary if we prevent setting the
bit in the nodemask or building the zonelists until the slab nodelists are
ready.
On Thu, 25 Feb 2010, David Rientjes wrote:
> On Thu, 25 Feb 2010, Christoph Lameter wrote:
>
> > > I don't see how memory hotadd with a new node being onlined could have
> > > worked fine before since slab lacked any memory hotplug notifier until
> > > Andi just added it.
> >
> > AFAICR The cpu notifier took on that role in the past.
> >
>
> The cpu notifier isn't involved if the firmware notifies the kernel that a
> new ACPI memory device has been added or you write a start address to
> /sys/devices/system/memory/probe. Hot-added memory devices can include
> ACPI_SRAT_MEM_HOT_PLUGGABLE entries in the SRAT for x86 that assign them
> non-online node ids (although all such entries get their bits set in
> node_possible_map at boot), so a new pgdat may be allocated for the node's
> registered range.
Yes Andi's work makes it explicit but there is already code in the cpu
notifier (see cpuup_prepare) that seems to have been intended to
initialize the node structures. Wonder why the hotplug people never
addressed that issue? Kame?
list_for_each_entry(cachep, &cache_chain, next) {
/*
* Set up the size64 kmemlist for cpu before we can
* begin anything. Make sure some other cpu on this
* node has not already allocated this
*/
if (!cachep->nodelists[node]) {
l3 = kmalloc_node(memsize, GFP_KERNEL, node);
if (!l3)
goto bad;
kmem_list3_init(l3);
l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
((unsigned long)cachep) % REAPTIMEOUT_LIST3;
/*
* The l3s don't come and go as CPUs come and
* go. cache_chain_mutex is sufficient
* protection here.
*/
cachep->nodelists[node] = l3;
}
spin_lock_irq(&cachep->nodelists[node]->list_lock);
cachep->nodelists[node]->free_limit =
(1 + nr_cpus_node(node)) *
cachep->batchcount + cachep->num;
spin_unlock_irq(&cachep->nodelists[node]->list_lock);
}
> kmalloc_node() in generic kernel code. All that is done under
> MEM_GOING_ONLINE and not MEM_ONLINE, which is why I suggest the first and
> fourth patch in this series may not be necessary if we prevent setting the
> bit in the nodemask or building the zonelists until the slab nodelists are
> ready.
That sounds good.
On Thu, 25 Feb 2010 12:30:26 -0600 (CST)
Christoph Lameter <[email protected]> wrote:
> On Thu, 25 Feb 2010, David Rientjes wrote:
>
> > I don't see how memory hotadd with a new node being onlined could have
> > worked fine before since slab lacked any memory hotplug notifier until
> > Andi just added it.
>
> AFAICR The cpu notifier took on that role in the past.
>
> If what you say is true then memory hotplug has never worked before.
> Kamesan?
>
In this code,
int node = numa_node_id();
node is got by its CPU.
At node hotplug, following order should be kept.
Add: memory -> cpu
Remove: cpu -> memory
cpus must be onlined after memory. At least, we online cpus only after
memory. Then, we(our heavy test on RHEL5) never see this kind of race.
I'm sorry if my answer misses your point.
Thanks,
-Kame
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Christoph Lameter kirjoitti:
>> kmalloc_node() in generic kernel code. All that is done under
>> MEM_GOING_ONLINE and not MEM_ONLINE, which is why I suggest the first and
>> fourth patch in this series may not be necessary if we prevent setting the
>> bit in the nodemask or building the zonelists until the slab nodelists are
>> ready.
>
> That sounds good.
Andi?
On Thu, Feb 25, 2010 at 12:30:26PM -0600, Christoph Lameter wrote:
> On Thu, 25 Feb 2010, David Rientjes wrote:
>
> > I don't see how memory hotadd with a new node being onlined could have
> > worked fine before since slab lacked any memory hotplug notifier until
> > Andi just added it.
>
> AFAICR The cpu notifier took on that role in the past.
The problem is that slab already allocates inside the notifier
and then some state wasn't set up.
> If what you say is true then memory hotplug has never worked before.
> Kamesan?
Memory hotplug with node add never quite worked on x86 before,
for various reasons not related to slab.
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, Feb 26, 2010 at 12:45:02PM +0200, Pekka Enberg wrote:
> Christoph Lameter kirjoitti:
>>> kmalloc_node() in generic kernel code. All that is done under
>>> MEM_GOING_ONLINE and not MEM_ONLINE, which is why I suggest the first and
>>> fourth patch in this series may not be necessary if we prevent setting the
>>> bit in the nodemask or building the zonelists until the slab nodelists are
>>> ready.
>>
>> That sounds good.
>
> Andi?
Well if Christoph wants to submit a better patch that is tested and solves
the problems he can do that.
if he doesn't then I think my patch kit which has been tested
is the best alternative currently.
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, Feb 26, 2010 at 1:43 PM, Andi Kleen <[email protected]> wrote:
> On Fri, Feb 26, 2010 at 12:45:02PM +0200, Pekka Enberg wrote:
>> Christoph Lameter kirjoitti:
>>>> kmalloc_node() in generic kernel code. ?All that is done under
>>>> MEM_GOING_ONLINE and not MEM_ONLINE, which is why I suggest the first and
>>>> fourth patch in this series may not be necessary if we prevent setting the
>>>> bit in the nodemask or building the zonelists until the slab nodelists are
>>>> ready.
>>>
>>> That sounds good.
>>
>> Andi?
>
> Well if Christoph wants to submit a better patch that is tested and solves
> the problems he can do that.
Sure.
> if he doesn't then I think my patch kit which has been tested
> is the best alternative currently.
So do you expect me to merge your patches over his objections?
Pekka
On Fri, Feb 26, 2010 at 02:35:24PM +0200, Pekka Enberg wrote:
> On Fri, Feb 26, 2010 at 1:43 PM, Andi Kleen <[email protected]> wrote:
> > On Fri, Feb 26, 2010 at 12:45:02PM +0200, Pekka Enberg wrote:
> >> Christoph Lameter kirjoitti:
> >>>> kmalloc_node() in generic kernel code. ?All that is done under
> >>>> MEM_GOING_ONLINE and not MEM_ONLINE, which is why I suggest the first and
> >>>> fourth patch in this series may not be necessary if we prevent setting the
> >>>> bit in the nodemask or building the zonelists until the slab nodelists are
> >>>> ready.
> >>>
> >>> That sounds good.
> >>
> >> Andi?
> >
> > Well if Christoph wants to submit a better patch that is tested and solves
> > the problems he can do that.
>
> Sure.
>
> > if he doesn't then I think my patch kit which has been tested
> > is the best alternative currently.
>
> So do you expect me to merge your patches over his objections?
Let's put it like this: i'm sure there a myriad different
way in all the possible design spaces to change slab to
make memory hotadd work.
Unless someone gives me a strong reason (e.g. code as submitted
doesn't work or is really unclean) I'm not very motivated to try them
all (also given that slab.c is really legacy code that will
hopefully go away at some point).
Also there are still other bugs to fix in memory hotadd and I'm focussing
my efforts on that.
I don't think the patches I submitted are particularly intrusive or
unclean or broken.
As far as I can see Christoph's proposal was just another way
to do this, but it wasn't clear to me it was better enough
in any way to spend significant time on it.
So yes I would prefer if you merged them as submitted just
to fix the bugs. If someone else comes up with a better way
to do this and submits patches they could still change
to that later.
As for the timer race patch: I cannot make a strong
argument right now that it's needed, on the other hand
a bit of defensive programming also doesn't hurt. But
if that one is not in I won't cry.
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, 26 Feb 2010, Andi Kleen wrote:
> Memory hotplug with node add never quite worked on x86 before,
> for various reasons not related to slab.
Ok but why did things break in such a big way?
I mean why the core changes if this is an x86 issue?
On Fri, Feb 26, 2010 at 09:04:56AM -0600, Christoph Lameter wrote:
> On Fri, 26 Feb 2010, Andi Kleen wrote:
>
> > Memory hotplug with node add never quite worked on x86 before,
> > for various reasons not related to slab.
>
> Ok but why did things break in such a big way?
1) numa memory hotadd never worked
2) the rest just bitrotted because nobody tested it.
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, Feb 26, 2010 at 09:05:48AM -0600, Christoph Lameter wrote:
>
> I mean why the core changes if this is an x86 issue?
The slab bugs are in no way related to x86, other than x86 supporting
memory hotadd & numa.
I only wrote "on x86" because I wasn't sure about the status on the other
platforms.
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, 26 Feb 2010, Andi Kleen wrote:
> > > Memory hotplug with node add never quite worked on x86 before,
> > > for various reasons not related to slab.
> >
> > Ok but why did things break in such a big way?
>
> 1) numa memory hotadd never worked
Well Kamesan indicated that this worked if a cpu became online.
> 2) the rest just bitrotted because nobody tested it.
Yep. David: Can you revise the relevant portions of the patchset and
repost it?
On Fri, Feb 26, 2010 at 11:24:50AM -0600, Christoph Lameter wrote:
> On Fri, 26 Feb 2010, Andi Kleen wrote:
>
> > > > Memory hotplug with node add never quite worked on x86 before,
> > > > for various reasons not related to slab.
> > >
> > > Ok but why did things break in such a big way?
> >
> > 1) numa memory hotadd never worked
>
> Well Kamesan indicated that this worked if a cpu became online.
I mean in the general case. There were tons of problems all over.
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, 26 Feb 2010, Christoph Lameter wrote:
> > 1) numa memory hotadd never worked
>
> Well Kamesan indicated that this worked if a cpu became online.
>
That may be true, but it doesn't address hotpluggable
ACPI_SRAT_MEM_HOT_PLUGGABLE entries for CONFIG_MEMORY_HOTPLUG_SPARSE where
no cpus are being onlined or writing to /sys/devices/system/memory/probe
for CONFIG_ARCH_MEMORY_PROBE.
> > 2) the rest just bitrotted because nobody tested it.
>
> Yep. David: Can you revise the relevant portions of the patchset and
> repost it?
>
Ok.
On Fri, 26 Feb 2010 18:31:15 +0100
Andi Kleen <[email protected]> wrote:
> On Fri, Feb 26, 2010 at 11:24:50AM -0600, Christoph Lameter wrote:
> > On Fri, 26 Feb 2010, Andi Kleen wrote:
> >
> > > > > Memory hotplug with node add never quite worked on x86 before,
> > > > > for various reasons not related to slab.
> > > >
> > > > Ok but why did things break in such a big way?
> > >
> > > 1) numa memory hotadd never worked
> >
> > Well Kamesan indicated that this worked if a cpu became online.
>
> I mean in the general case. There were tons of problems all over.
>
Then, it's cpu hotplug matter, not memory hotplug.
cpu hotplug callback should prepaare
l3 = searchp->nodelists[node];
BUG_ON(!l3);
before onlined. Rather than taking care of races.
Thanks,
-Kame