Changelog since V2
o Turned out that allocating per-cpu areas for node ids on ppc64 just
wasn't stable. This series statically declares the per-node data. This
wastes memory but it appears to work.
Currently SLQB is not allowed to be configured on PPC and S390 machines as
CPUs can belong to memoryless nodes. SLQB does not deal with this very well
and crashes reliably.
These patches partially fix the memoryless node problem for SLQB. The
machine will boot successfully but is unstable under stress indicating
that SLQB has some serious problems when dealing with pages from remote
nodes. The remote node stability may be linked to the per-cpu stability
problem so should be treated as separate bugs.
Patch 1 statically defines some per-node structures instead of using a fun
hack with DEFINE_PER_CPU. The per-node areas are not always getting
initialised by the architecture which led to a crash.
Patch 2 notes that on memoryless configurations, memory is always freed
remotely but always allocates locally and falls back to the page
allocator on failure. This effectively is a memory leak. This patch
records in kmem_cache_cpu what node it considers local to be either
the real local node or the closest node available.
Patch 3 allows SLQB to be configured on PPC again and S390. These patches
address most of the memoryless node issues on PPC and the expectation
is that the remaining bugs in SLQB are to do with remote nodes,
per-cpu area allocation or both. This patch enables SLQB on S390
as it has been reported by Heiko Carstens that issues there have
been independently resolved.
I believe these are ready for merging although it would be preferred if
Nick signed-off. Christoph has suggested that SLQB should be disabled for
NUMA but I feel if it's disabled, the problem may never be resolved. Hence
I didn't patch accordingly but Pekka or Nick may feel different.
include/linux/slqb_def.h | 3 ++
init/Kconfig | 1 -
mm/slqb.c | 52 ++++++++++++++++++++++++++++-----------------
3 files changed, 35 insertions(+), 21 deletions(-)
SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
assumption is made that all valid node IDs will have matching valid CPU
ids. In memoryless configurations, it is possible to have a node ID with
no CPU having the same ID. When this happens, per-cpu areas are not
initialised and the per-node data is effectively random.
An attempt was made to force the allocation of per-cpu areas corresponding
to active node IDs. However, for reasons unknown this led to silent
lockups. Instead, this patch fixes the SLQB problem by forcing the per-node
data to be statically declared.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/slqb.c | 16 ++++++++--------
1 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/slqb.c b/mm/slqb.c
index 4ca85e2..4d72be2 100644
--- a/mm/slqb.c
+++ b/mm/slqb.c
@@ -1944,16 +1944,16 @@ static void init_kmem_cache_node(struct kmem_cache *s,
static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cache_cpus);
#endif
#ifdef CONFIG_NUMA
-/* XXX: really need a DEFINE_PER_NODE for per-node data, but this is better than
- * a static array */
-static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cache_nodes);
+/* XXX: really need a DEFINE_PER_NODE for per-node data because a static
+ * array is wasteful */
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
#endif
#ifdef CONFIG_SMP
static struct kmem_cache kmem_cpu_cache;
static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cpu_cpus);
#ifdef CONFIG_NUMA
-static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cpu_nodes); /* XXX per-nid */
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES]; /* XXX per-nid */
#endif
#endif
@@ -1962,7 +1962,7 @@ static struct kmem_cache kmem_node_cache;
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_node_cpus);
#endif
-static DEFINE_PER_CPU(struct kmem_cache_node, kmem_node_nodes); /*XXX per-nid */
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES]; /*XXX per-nid */
#endif
#ifdef CONFIG_SMP
@@ -2918,15 +2918,15 @@ void __init kmem_cache_init(void)
for_each_node_state(i, N_NORMAL_MEMORY) {
struct kmem_cache_node *n;
- n = &per_cpu(kmem_cache_nodes, i);
+ n = &kmem_cache_nodes[i];
init_kmem_cache_node(&kmem_cache_cache, n);
kmem_cache_cache.node_slab[i] = n;
#ifdef CONFIG_SMP
- n = &per_cpu(kmem_cpu_nodes, i);
+ n = &kmem_cpu_nodes[i];
init_kmem_cache_node(&kmem_cpu_cache, n);
kmem_cpu_cache.node_slab[i] = n;
#endif
- n = &per_cpu(kmem_node_nodes, i);
+ n = &kmem_node_nodes[i];
init_kmem_cache_node(&kmem_node_cache, n);
kmem_node_cache.node_slab[i] = n;
}
--
1.6.3.3
When freeing a page, SLQB checks if the page belongs to the local node.
If it is not, it is considered a remote free. On the allocation side, it
always checks the local lists and if they are empty, the page allocator
is called. On memoryless configurations, this is effectively a memory
leak and the machine quickly kills itself in an OOM storm.
This patch records what node ID is considered local to a CPU. As the
management structure for the CPU is always allocated from the closest
node, the node the CPU structure resides on is considered "local".
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/slqb_def.h | 3 +++
mm/slqb.c | 23 +++++++++++++++++------
2 files changed, 20 insertions(+), 6 deletions(-)
diff --git a/include/linux/slqb_def.h b/include/linux/slqb_def.h
index 1243dda..2ccbe7e 100644
--- a/include/linux/slqb_def.h
+++ b/include/linux/slqb_def.h
@@ -101,6 +101,9 @@ struct kmem_cache_cpu {
struct kmem_cache_list list; /* List for node-local slabs */
unsigned int colour_next; /* Next colour offset to use */
+ /* local_nid will be numa_node_id() except when memoryless */
+ unsigned int local_nid;
+
#ifdef CONFIG_SMP
/*
* rlist is a list of objects that don't fit on list.freelist (ie.
diff --git a/mm/slqb.c b/mm/slqb.c
index 4d72be2..89fd8e4 100644
--- a/mm/slqb.c
+++ b/mm/slqb.c
@@ -1375,7 +1375,7 @@ static noinline void *__slab_alloc_page(struct kmem_cache *s,
if (unlikely(!page))
return page;
- if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+ if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == c->local_nid)) {
struct kmem_cache_cpu *c;
int cpu = smp_processor_id();
@@ -1501,15 +1501,16 @@ static __always_inline void *__slab_alloc(struct kmem_cache *s,
struct kmem_cache_cpu *c;
struct kmem_cache_list *l;
+ c = get_cpu_slab(s, smp_processor_id());
+ VM_BUG_ON(!c);
+
#ifdef CONFIG_NUMA
- if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+ if (unlikely(node != -1) && unlikely(node != c->local_nid)) {
try_remote:
return __remote_slab_alloc(s, gfpflags, node);
}
#endif
- c = get_cpu_slab(s, smp_processor_id());
- VM_BUG_ON(!c);
l = &c->list;
object = __cache_list_get_object(s, l);
if (unlikely(!object)) {
@@ -1518,7 +1519,7 @@ try_remote:
object = __slab_alloc_page(s, gfpflags, node);
#ifdef CONFIG_NUMA
if (unlikely(!object)) {
- node = numa_node_id();
+ node = c->local_nid;
goto try_remote;
}
#endif
@@ -1733,7 +1734,7 @@ static __always_inline void __slab_free(struct kmem_cache *s,
slqb_stat_inc(l, FREE);
if (!NUMA_BUILD || !slab_numa(s) ||
- likely(slqb_page_to_nid(page) == numa_node_id())) {
+ likely(slqb_page_to_nid(page) == c->local_nid)) {
/*
* Freeing fastpath. Collects all local-node objects, not
* just those allocated from our per-CPU list. This allows
@@ -1928,6 +1929,16 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
c->rlist.tail = NULL;
c->remote_cache_list = NULL;
#endif
+
+ /*
+ * Determine what the local node to this CPU is. Ordinarily
+ * this would be cpu_to_node() but for memoryless nodes, that
+ * is not the best value. Instead, we take the numa node that
+ * kmem_cache_cpu is allocated from as being the best guess
+ * as being local because it'll match what the page allocator
+ * thinks is the most local
+ */
+ c->local_nid = page_to_nid(virt_to_page((unsigned long)c & PAGE_MASK));
}
#ifdef CONFIG_NUMA
--
1.6.3.3
SLQB was disabled on PPC as it would stab itself in the face when running
on machines with CPUs on memoryless nodes and was disabled on S390 due
to other functionality difficulties. S390 has been independently fixed
and PPC should work in most configurations with remote locking of nodes
still with some difficulties. Allow SLQB to be configured again so the
dodgy configurations can be further identified and debugged.
Signed-off-by: Mel Gorman <[email protected]>
---
init/Kconfig | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/init/Kconfig b/init/Kconfig
index adc10ab..c56248f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1033,7 +1033,6 @@ config SLUB
config SLQB
bool "SLQB (Queued allocator)"
- depends on !PPC && !S390
help
SLQB is a proposed new slab allocator.
--
1.6.3.3
On Tue, Sep 22, 2009 at 01:54:11PM +0100, Mel Gorman wrote:
> Changelog since V2
> o Turned out that allocating per-cpu areas for node ids on ppc64 just
> wasn't stable. This series statically declares the per-node data. This
> wastes memory but it appears to work.
>
> Currently SLQB is not allowed to be configured on PPC and S390 machines as
> CPUs can belong to memoryless nodes. SLQB does not deal with this very well
> and crashes reliably.
>
GACK. Sorry about the 1/4, 2/4, 3/4 problem. There are only three
patches in this set. I dropped the last patch which was related to the
SLQB corruption problem because it didn't appear to help and didn't fix
up the number. Sorry.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Hi Mel,
On Tue, Sep 22, 2009 at 3:54 PM, Mel Gorman <[email protected]> wrote:
> When freeing a page, SLQB checks if the page belongs to the local node.
> If it is not, it is considered a remote free. On the allocation side, it
> always checks the local lists and if they are empty, the page allocator
> is called. On memoryless configurations, this is effectively a memory
> leak and the machine quickly kills itself in an OOM storm.
>
> This patch records what node ID is considered local to a CPU. As the
> management structure for the CPU is always allocated from the closest
> node, the node the CPU structure resides on is considered "local".
>
> Signed-off-by: Mel Gorman <[email protected]>
I don't understand how the memory leak happens from the above
description (or reading the code). page_to_nid() returns some crazy
value at free time? The remote list isn't drained properly?
Pekka
On Tue, Sep 22, 2009 at 04:38:32PM +0300, Pekka Enberg wrote:
> Hi Mel,
>
> On Tue, Sep 22, 2009 at 3:54 PM, Mel Gorman <[email protected]> wrote:
> > When freeing a page, SLQB checks if the page belongs to the local node.
> > If it is not, it is considered a remote free. On the allocation side, it
> > always checks the local lists and if they are empty, the page allocator
> > is called. On memoryless configurations, this is effectively a memory
> > leak and the machine quickly kills itself in an OOM storm.
> >
> > This patch records what node ID is considered local to a CPU. As the
> > management structure for the CPU is always allocated from the closest
> > node, the node the CPU structure resides on is considered "local".
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> I don't understand how the memory leak happens from the above
> description (or reading the code). page_to_nid() returns some crazy
> value at free time?
Nope, it isn't a leak as such, the allocator knows where the memory is.
The problem is that is always frees remote but on allocation, it sees
the per-cpu list is empty and calls the page allocator again. The remote
lists just grow.
> The remote list isn't drained properly?
>
That is another way of looking at it. When the remote lists get to a
watermark, they should drain. However, it's worth pointing out if it's
repaired in this fashion, the performance of SLQB will suffer as it'll
never reuse the local list of pages and instead always get cold pages
from the allocator.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Hi Mel,
On Tue, Sep 22, 2009 at 4:54 PM, Mel Gorman <[email protected]> wrote:
>> I don't understand how the memory leak happens from the above
>> description (or reading the code). page_to_nid() returns some crazy
>> value at free time?
>
> Nope, it isn't a leak as such, the allocator knows where the memory is.
> The problem is that is always frees remote but on allocation, it sees
> the per-cpu list is empty and calls the page allocator again. The remote
> lists just grow.
>
>> The remote list isn't drained properly?
>
> That is another way of looking at it. When the remote lists get to a
> watermark, they should drain. However, it's worth pointing out if it's
> repaired in this fashion, the performance of SLQB will suffer as it'll
> never reuse the local list of pages and instead always get cold pages
> from the allocator.
I worry about setting c->local_nid to the node of the allocated struct
kmem_cache_cpu. It seems like an arbitrary policy decision that's not
necessarily the best option and I'm not totally convinced it's correct
when cpusets are configured. SLUB seems to do the sane thing here by
using page allocator fallback (which respects cpusets AFAICT) and
recycling one slab slab at a time.
Can I persuade you into sending me a patch that fixes remote list
draining to get things working on PPC? I'd much rather wait for Nick's
input on the allocation policy and performance.
Pekka
On Tue, Sep 22, 2009 at 3:54 PM, Mel Gorman <[email protected]> wrote:
> SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
> assumption is made that all valid node IDs will have matching valid CPU
> ids. In memoryless configurations, it is possible to have a node ID with
> no CPU having the same ID. When this happens, per-cpu areas are not
> initialised and the per-node data is effectively random.
>
> An attempt was made to force the allocation of per-cpu areas corresponding
> to active node IDs. However, for reasons unknown this led to silent
> lockups. Instead, this patch fixes the SLQB problem by forcing the per-node
> data to be statically declared.
>
> Signed-off-by: Mel Gorman <[email protected]>
Applied, thanks!
On Tue, Sep 22, 2009 at 09:54:33PM +0300, Pekka Enberg wrote:
> Hi Mel,
>
> On Tue, Sep 22, 2009 at 4:54 PM, Mel Gorman <[email protected]> wrote:
> >> I don't understand how the memory leak happens from the above
> >> description (or reading the code). page_to_nid() returns some crazy
> >> value at free time?
> >
> > Nope, it isn't a leak as such, the allocator knows where the memory is.
> > The problem is that is always frees remote but on allocation, it sees
> > the per-cpu list is empty and calls the page allocator again. The remote
> > lists just grow.
> >
> >> The remote list isn't drained properly?
> >
> > That is another way of looking at it. When the remote lists get to a
> > watermark, they should drain. However, it's worth pointing out if it's
> > repaired in this fashion, the performance of SLQB will suffer as it'll
> > never reuse the local list of pages and instead always get cold pages
> > from the allocator.
>
> I worry about setting c->local_nid to the node of the allocated struct
> kmem_cache_cpu. It seems like an arbitrary policy decision that's not
> necessarily the best option and I'm not totally convinced it's correct
> when cpusets are configured. SLUB seems to do the sane thing here by
> using page allocator fallback (which respects cpusets AFAICT) and
> recycling one slab slab at a time.
>
> Can I persuade you into sending me a patch that fixes remote list
> draining to get things working on PPC? I'd much rather wait for Nick's
> input on the allocation policy and performance.
>
It'll be at least next week before I can revisit this again. I'm afraid
I'm going offline from tomorrow until Tuesday.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, Sep 22, 2009 at 07:56:08PM +0100, Mel Gorman wrote:
> On Tue, Sep 22, 2009 at 09:54:33PM +0300, Pekka Enberg wrote:
> > Hi Mel,
> >
> > On Tue, Sep 22, 2009 at 4:54 PM, Mel Gorman <[email protected]> wrote:
> > >> I don't understand how the memory leak happens from the above
> > >> description (or reading the code). page_to_nid() returns some crazy
> > >> value at free time?
> > >
> > > Nope, it isn't a leak as such, the allocator knows where the memory is.
> > > The problem is that is always frees remote but on allocation, it sees
> > > the per-cpu list is empty and calls the page allocator again. The remote
> > > lists just grow.
> > >
> > >> The remote list isn't drained properly?
> > >
> > > That is another way of looking at it. When the remote lists get to a
> > > watermark, they should drain. However, it's worth pointing out if it's
> > > repaired in this fashion, the performance of SLQB will suffer as it'll
> > > never reuse the local list of pages and instead always get cold pages
> > > from the allocator.
> >
> > I worry about setting c->local_nid to the node of the allocated struct
> > kmem_cache_cpu. It seems like an arbitrary policy decision that's not
> > necessarily the best option and I'm not totally convinced it's correct
> > when cpusets are configured. SLUB seems to do the sane thing here by
> > using page allocator fallback (which respects cpusets AFAICT) and
> > recycling one slab slab at a time.
> >
> > Can I persuade you into sending me a patch that fixes remote list
> > draining to get things working on PPC? I'd much rather wait for Nick's
> > input on the allocation policy and performance.
> >
>
> It'll be at least next week before I can revisit this again. I'm afraid
> I'm going offline from tomorrow until Tuesday.
>
Ok, so I spent today looking at this again. The problem is not with faulty
drain logic as such. As frees always place an object on a remote list
and the allocation side is often (but not always) allocating a new page,
a significant number of objects in the free list are the only object
in a page. SLQB drains based on the number of objects on the free list,
not the number of pages. With many of the pages having only one object,
the freelists are pinning a lot more memory than expected. For example,
a watermark to drain of 512 could be pinning 2MB of pages.
The drain logic could be extended to track not only the number of objects on
the free list but also the number of pages but I really don't think that is
desirable behaviour. I'm somewhat running out of sensible ideas for dealing
with this but here is another go anyway that might be more palatable than
tracking what a "local" node is within the slab.
This boots on 2.6.32-rc1 with the latest slqb-core git tree with
Kconfig modified to allow SLQB to be set on ppc64.
==== CUT HERE ====
SLQB: Allocate from the remote lists when the local node is memoryless and has no free objects
When SLQB is freeing an object, it checks if the object belongs to a
page within the local node. If it is not, the object is freed to a
remote list. When the remote list has too many objects, the list is
drained.
On allocation, the remote list is only used if a specific node is specified
and that node is not the local node. On memoryless nodes, there is a problem
in that the specified node will often not be the local node. The impact is
that many objects on the free list are the only object in the page. This
bloats SLQB's memory requirements and causes OOM to trigger.
This patch alters the allocation path. If the allocation from local
lists fails and the local node is memoryless, an attempt will be made to
allocate from the remote lists before going to the page allocator.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/slqb.c | 30 ++++++++++++++++++++++--------
1 file changed, 22 insertions(+), 8 deletions(-)
diff --git a/mm/slqb.c b/mm/slqb.c
index 4d72be2..b73e7d0 100644
--- a/mm/slqb.c
+++ b/mm/slqb.c
@@ -1513,16 +1513,30 @@ try_remote:
l = &c->list;
object = __cache_list_get_object(s, l);
if (unlikely(!object)) {
- object = cache_list_get_page(s, l);
- if (unlikely(!object)) {
- object = __slab_alloc_page(s, gfpflags, node);
-#ifdef CONFIG_NUMA
+ int thisnode = numa_node_id();
+
+ /*
+ * If the local node is memoryless, try remote alloc before
+ * trying the page allocator. Otherwise, what happens is
+ * objects are always freed to remote lists but the allocation
+ * side always allocates a new page with only one object
+ * used in each page
+ */
+ if (unlikely(!node_state(thisnode, N_HIGH_MEMORY)))
+ object = __remote_slab_alloc(s, gfpflags, thisnode);
+
+ if (!object) {
+ object = cache_list_get_page(s, l);
if (unlikely(!object)) {
- node = numa_node_id();
- goto try_remote;
- }
+ object = __slab_alloc_page(s, gfpflags, node);
+#ifdef CONFIG_NUMA
+ if (unlikely(!object)) {
+ node = numa_node_id();
+ goto try_remote;
+ }
#endif
- return object;
+ return object;
+ }
}
}
if (likely(object))
On Wed, 30 Sep 2009, Mel Gorman wrote:
> Ok, so I spent today looking at this again. The problem is not with faulty
> drain logic as such. As frees always place an object on a remote list
> and the allocation side is often (but not always) allocating a new page,
> a significant number of objects in the free list are the only object
> in a page. SLQB drains based on the number of objects on the free list,
> not the number of pages. With many of the pages having only one object,
> the freelists are pinning a lot more memory than expected. For example,
> a watermark to drain of 512 could be pinning 2MB of pages.
No good. So we are allocating new pages from somewhere allocating a
single object and putting them on the freelist where we do not find them
again. This is bad caching behavior as well.
> The drain logic could be extended to track not only the number of objects on
> the free list but also the number of pages but I really don't think that is
> desirable behaviour. I'm somewhat running out of sensible ideas for dealing
> with this but here is another go anyway that might be more palatable than
> tracking what a "local" node is within the slab.
SLUB avoids that issue by having a "current" page for a processor. It
allocates from the current page until its exhausted. It can use fast path
logic both for allocations and frees regardless of the pages origin. The
node fallback is handled by the page allocator and that one is only
involved when a new slab page is needed.
SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
order for free objects of the kmem_cache and then picks up from the
nearest node. Ugly but it works. SLQB would have to do something similar
since it also has the per node object bins that SLAB has.
The local node for a memoryless node may not exist at all since there may
be multiple nodes at the same distance to the memoryless node. So at
mininum you would have to manage a set of local nodes. If you have the set
then you also would need to consider memory policies. During bootup you
would have to simulate the interleave mode in effect. After bootup you
would have to use the tasks policy.
This all points to major NUMA issues in SLQB. This is not arch specific.
SLQB cannot handle memoryless nodes at this point.
> This patch alters the allocation path. If the allocation from local
> lists fails and the local node is memoryless, an attempt will be made to
> allocate from the remote lists before going to the page allocator.
Are the allocation attempts from the remote lists governed by memory
policies? Otherwise you may create imbalances on neighboring nodes.
On Wed, Sep 30, 2009 at 11:06:04AM -0400, Christoph Lameter wrote:
> On Wed, 30 Sep 2009, Mel Gorman wrote:
>
> > Ok, so I spent today looking at this again. The problem is not with faulty
> > drain logic as such. As frees always place an object on a remote list
> > and the allocation side is often (but not always) allocating a new page,
> > a significant number of objects in the free list are the only object
> > in a page. SLQB drains based on the number of objects on the free list,
> > not the number of pages. With many of the pages having only one object,
> > the freelists are pinning a lot more memory than expected. For example,
> > a watermark to drain of 512 could be pinning 2MB of pages.
>
> No good. So we are allocating new pages from somewhere allocating a
> single object and putting them on the freelist where we do not find them
> again.
Yes
> This is bad caching behavior as well.
>
Yes, I suppose it would be as it's not using the hottest object. The
fact it OOM storms is a bit more important than poor caching behaviour
but hey :/
> > The drain logic could be extended to track not only the number of objects on
> > the free list but also the number of pages but I really don't think that is
> > desirable behaviour. I'm somewhat running out of sensible ideas for dealing
> > with this but here is another go anyway that might be more palatable than
> > tracking what a "local" node is within the slab.
>
> SLUB avoids that issue by having a "current" page for a processor. It
> allocates from the current page until its exhausted. It can use fast path
> logic both for allocations and frees regardless of the pages origin. The
> node fallback is handled by the page allocator and that one is only
> involved when a new slab page is needed.
>
This is essentially the "unqueued" nature of SLUB. It's objective "I have this
page here which I'm going to use until I can't use it no more and will depend
on the page allocator to sort my stuff out". I have to read up on SLUB up
more to see if it's compatible with SLQB or not though. In particular, how
does SLUB deal with frees from pages that are not the "current" page? SLQB
does not care what page the object belongs to as long as it's node-local
as the object is just shoved onto a LIFO for maximum hotness.
> SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> order for free objects of the kmem_cache and then picks up from the
> nearest node. Ugly but it works. SLQB would have to do something similar
> since it also has the per node object bins that SLAB has.
>
In a real sense, this is what the patch ends up doing. When it fails to
get something locally but sees that the local node is memoryless, it
will check the remote node lists in zonelist order. I think that's
reasonable behaviour but I'm biased because I just want the damn machine
to boot again. What do you think? Pekka, Nick?
> The local node for a memoryless node may not exist at all since there may
> be multiple nodes at the same distance to the memoryless node. So at
> mininum you would have to manage a set of local nodes. If you have the set
> then you also would need to consider memory policies. During bootup you
> would have to simulate the interleave mode in effect. After bootup you
> would have to use the tasks policy.
>
I think SLQBs treatment of memory policies needs to be handled as a separate
problem. It's less than perfect at the moment, more of that below.
> This all points to major NUMA issues in SLQB. This is not arch specific.
> SLQB cannot handle memoryless nodes at this point.
>
> > This patch alters the allocation path. If the allocation from local
> > lists fails and the local node is memoryless, an attempt will be made to
> > allocate from the remote lists before going to the page allocator.
>
> Are the allocation attempts from the remote lists governed by memory
> policies?
It does to some extent. When selecting a node zonelist, it takes the
current memory policy into account but at a glance, it does not appear
to obey a policy that restricts the available nodes.
> Otherwise you may create imbalances on neighboring nodes.
>
I haven't thought about this aspect of things a whole lot to be honest.
It's not the problem at hand.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wed, 30 Sep 2009, Mel Gorman wrote:
> > SLUB avoids that issue by having a "current" page for a processor. It
> > allocates from the current page until its exhausted. It can use fast path
> > logic both for allocations and frees regardless of the pages origin. The
> > node fallback is handled by the page allocator and that one is only
> > involved when a new slab page is needed.
> >
>
> This is essentially the "unqueued" nature of SLUB. It's objective "I have this
> page here which I'm going to use until I can't use it no more and will depend
> on the page allocator to sort my stuff out". I have to read up on SLUB up
> more to see if it's compatible with SLQB or not though. In particular, how
> does SLUB deal with frees from pages that are not the "current" page? SLQB
> does not care what page the object belongs to as long as it's node-local
> as the object is just shoved onto a LIFO for maximum hotness.
Frees are done directly to the target slab page if they are not to the
current active slab page. No centralized locks. Concurrent frees from
processors on the same node to multiple other nodes (or different pages
on the same node) can occur.
> > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> > order for free objects of the kmem_cache and then picks up from the
> > nearest node. Ugly but it works. SLQB would have to do something similar
> > since it also has the per node object bins that SLAB has.
> >
>
> In a real sense, this is what the patch ends up doing. When it fails to
> get something locally but sees that the local node is memoryless, it
> will check the remote node lists in zonelist order. I think that's
> reasonable behaviour but I'm biased because I just want the damn machine
> to boot again. What do you think? Pekka, Nick?
Look at fallback_alloc() in slab. You can likely copy much of it. It
considers memory policies and cpuset constraints.
On Wed, Sep 30, 2009 at 07:45:22PM -0400, Christoph Lameter wrote:
> On Wed, 30 Sep 2009, Mel Gorman wrote:
>
> > > SLUB avoids that issue by having a "current" page for a processor. It
> > > allocates from the current page until its exhausted. It can use fast path
> > > logic both for allocations and frees regardless of the pages origin. The
> > > node fallback is handled by the page allocator and that one is only
> > > involved when a new slab page is needed.
> > >
> >
> > This is essentially the "unqueued" nature of SLUB. It's objective "I have this
> > page here which I'm going to use until I can't use it no more and will depend
> > on the page allocator to sort my stuff out". I have to read up on SLUB up
> > more to see if it's compatible with SLQB or not though. In particular, how
> > does SLUB deal with frees from pages that are not the "current" page? SLQB
> > does not care what page the object belongs to as long as it's node-local
> > as the object is just shoved onto a LIFO for maximum hotness.
>
> Frees are done directly to the target slab page if they are not to the
> current active slab page. No centralized locks. Concurrent frees from
> processors on the same node to multiple other nodes (or different pages
> on the same node) can occur.
>
So as a total aside, SLQB has an advantage in that it always uses object
in LIFO order and is more likely to be cache hot. SLUB has an advantage
when one CPU allocates and another one frees because it potentially
avoids a cache line bounce. Might be something worth bearing in mind
when/if a comparison happens later.
> > > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> > > order for free objects of the kmem_cache and then picks up from the
> > > nearest node. Ugly but it works. SLQB would have to do something similar
> > > since it also has the per node object bins that SLAB has.
> > >
> >
> > In a real sense, this is what the patch ends up doing. When it fails to
> > get something locally but sees that the local node is memoryless, it
> > will check the remote node lists in zonelist order. I think that's
> > reasonable behaviour but I'm biased because I just want the damn machine
> > to boot again. What do you think? Pekka, Nick?
>
> Look at fallback_alloc() in slab. You can likely copy much of it. It
> considers memory policies and cpuset constraints.
>
True, it looks like some of the logic should be taken from there all right. Can
the treatment of memory policies be dealt with as a separate thread though? I'd
prefer to get memoryless nodes sorted out before considering the next two
problems (per-cpu instability on ppc64 and memory policy handling in SLQB).
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, 1 Oct 2009, Mel Gorman wrote:
> > Frees are done directly to the target slab page if they are not to the
> > current active slab page. No centralized locks. Concurrent frees from
> > processors on the same node to multiple other nodes (or different pages
> > on the same node) can occur.
> >
>
> So as a total aside, SLQB has an advantage in that it always uses object
> in LIFO order and is more likely to be cache hot. SLUB has an advantage
> when one CPU allocates and another one frees because it potentially
> avoids a cache line bounce. Might be something worth bearing in mind
> when/if a comparison happens later.
SLQB may use cache hot objects regardless of their locality. SLUB
always serves objects that have the same locality first (same page).
SLAB returns objects via the alien caches to the remote node.
So object allocations with SLUB will generate less TLB pressure since they
are localized. SLUB objects are immediately returned to the remote node.
SLAB/SLQB keeps them around for reallocation or queue processing.
> > Look at fallback_alloc() in slab. You can likely copy much of it. It
> > considers memory policies and cpuset constraints.
> >
> True, it looks like some of the logic should be taken from there all right. Can
> the treatment of memory policies be dealt with as a separate thread though? I'd
> prefer to get memoryless nodes sorted out before considering the next two
> problems (per-cpu instability on ppc64 and memory policy handling in SLQB).
Separate email thread? Ok.
On Thu, 1 Oct 2009, Mel Gorman wrote:
> True, it might have been improved more if SLUB knew what local hugepage it
> resided within as the kernel portion of the address space is backed by huge
> TLB entries. Note that SLQB could have an advantage here early in boot as
> the page allocator will tend to give it back pages within a single huge TLB
> entry. It loses the advantage when the system has been running for a very long
> time but it might be enough to skew benchmark results on cold-booted systems.
The page allocator serves pages aligned to huge page boundaries as far as
I can remember. You can actually use huge pages in slub if you set the max
order to 9. So a page obtained from the page allocator is always aligned
properly.
On Thu, Oct 01, 2009 at 10:32:54AM -0400, Christoph Lameter wrote:
> On Thu, 1 Oct 2009, Mel Gorman wrote:
>
> > > Frees are done directly to the target slab page if they are not to the
> > > current active slab page. No centralized locks. Concurrent frees from
> > > processors on the same node to multiple other nodes (or different pages
> > > on the same node) can occur.
> > >
> >
> > So as a total aside, SLQB has an advantage in that it always uses object
> > in LIFO order and is more likely to be cache hot. SLUB has an advantage
> > when one CPU allocates and another one frees because it potentially
> > avoids a cache line bounce. Might be something worth bearing in mind
> > when/if a comparison happens later.
>
> SLQB may use cache hot objects regardless of their locality. SLUB
> always serves objects that have the same locality first (same page).
> SLAB returns objects via the alien caches to the remote node.
> So object allocations with SLUB will generate less TLB pressure since they
> are localized.
True, it might have been improved more if SLUB knew what local hugepage it
resided within as the kernel portion of the address space is backed by huge
TLB entries. Note that SLQB could have an advantage here early in boot as
the page allocator will tend to give it back pages within a single huge TLB
entry. It loses the advantage when the system has been running for a very long
time but it might be enough to skew benchmark results on cold-booted systems.
> SLUB objects are immediately returned to the remote node.
> SLAB/SLQB keeps them around for reallocation or queue processing.
>
> > > Look at fallback_alloc() in slab. You can likely copy much of it. It
> > > considers memory policies and cpuset constraints.
> > >
> > True, it looks like some of the logic should be taken from there all right. Can
> > the treatment of memory policies be dealt with as a separate thread though? I'd
> > prefer to get memoryless nodes sorted out before considering the next two
> > problems (per-cpu instability on ppc64 and memory policy handling in SLQB).
>
> Separate email thread? Ok.
>
Yes, but I'll be honest. It'll be at least two weeks before I can tackle
memory policy related issues in SLQB. It's not high on my list of
priorities. I'm more concerned with breakage on ppc64 and a patch that
forces it to be disabled. Minimally, I want this resolved before getting
distracted by another thread.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Oct 01, 2009 at 11:03:16AM -0400, Christoph Lameter wrote:
> On Thu, 1 Oct 2009, Mel Gorman wrote:
>
> > True, it might have been improved more if SLUB knew what local hugepage it
> > resided within as the kernel portion of the address space is backed by huge
> > TLB entries. Note that SLQB could have an advantage here early in boot as
> > the page allocator will tend to give it back pages within a single huge TLB
> > entry. It loses the advantage when the system has been running for a very long
> > time but it might be enough to skew benchmark results on cold-booted systems.
>
> The page allocator serves pages aligned to huge page boundaries as far as
> I can remember.
You're right, it does, particularly early in boot. It loses the advantage
when the system has been running a long time and memory is mostly full but
the same will apply to SLQB.
> You can actually use huge pages in slub if you set the max
> order to 9. So a page obtained from the page allocator is always aligned
> properly.
>
Fair point.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Oct 1, 2009 at 2:45 AM, Christoph Lameter
<[email protected]> wrote:
>> This is essentially the "unqueued" nature of SLUB. It's objective "I have this
>> page here which I'm going to use until I can't use it no more and will depend
>> on the page allocator to sort my stuff out". I have to read up on SLUB up
>> more to see if it's compatible with SLQB or not though. In particular, how
>> does SLUB deal with frees from pages that are not the "current" page? SLQB
>> does not care what page the object belongs to as long as it's node-local
>> as the object is just shoved onto a LIFO for maximum hotness.
>
> Frees are done directly to the target slab page if they are not to the
> current active slab page. No centralized locks. Concurrent frees from
> processors on the same node to multiple other nodes (or different pages
> on the same node) can occur.
>
>> > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
>> > order for free objects of the kmem_cache and then picks up from the
>> > nearest node. Ugly but it works. SLQB would have to do something similar
>> > since it also has the per node object bins that SLAB has.
>> >
>>
>> In a real sense, this is what the patch ends up doing. When it fails to
>> get something locally but sees that the local node is memoryless, it
>> will check the remote node lists in zonelist order. I think that's
>> reasonable behaviour but I'm biased because I just want the damn machine
>> to boot again. What do you think? Pekka, Nick?
>
> Look at fallback_alloc() in slab. You can likely copy much of it. It
> considers memory policies and cpuset constraints.
Sorry for the delay. I went ahead and merged Mel's patch to make
things boot on PPC. Fallback policy needs a bit more work as Christoph
says but I'd really love to have Nick's input on this.
Mel, do you have a Kconfig patch laying around somewhere to enable
SLQB on PPC and S390?
Pekka
On Sun, Oct 04, 2009 at 03:06:45PM +0300, Pekka Enberg wrote:
> On Thu, Oct 1, 2009 at 2:45 AM, Christoph Lameter
> <[email protected]> wrote:
> >> This is essentially the "unqueued" nature of SLUB. It's objective "I have this
> >> page here which I'm going to use until I can't use it no more and will depend
> >> on the page allocator to sort my stuff out". I have to read up on SLUB up
> >> more to see if it's compatible with SLQB or not though. In particular, how
> >> does SLUB deal with frees from pages that are not the "current" page? SLQB
> >> does not care what page the object belongs to as long as it's node-local
> >> as the object is just shoved onto a LIFO for maximum hotness.
> >
> > Frees are done directly to the target slab page if they are not to the
> > current active slab page. No centralized locks. Concurrent frees from
> > processors on the same node to multiple other nodes (or different pages
> > on the same node) can occur.
> >
> >> > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> >> > order for free objects of the kmem_cache and then picks up from the
> >> > nearest node. Ugly but it works. SLQB would have to do something similar
> >> > since it also has the per node object bins that SLAB has.
> >> >
> >>
> >> In a real sense, this is what the patch ends up doing. When it fails to
> >> get something locally but sees that the local node is memoryless, it
> >> will check the remote node lists in zonelist order. I think that's
> >> reasonable behaviour but I'm biased because I just want the damn machine
> >> to boot again. What do you think? Pekka, Nick?
> >
> > Look at fallback_alloc() in slab. You can likely copy much of it. It
> > considers memory policies and cpuset constraints.
>
> Sorry for the delay. I went ahead and merged Mel's patch to make
> things boot on PPC. Fallback policy needs a bit more work as Christoph
> says but I'd really love to have Nick's input on this.
>
> Mel, do you have a Kconfig patch laying around somewhere to enable
> SLQB on PPC and S390?
>
It's patch 4 of this series.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab