The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is valid")
as an optimized variant of alloc_pages_node(), that doesn't allow the node id
to be -1. Unfortunately the name of the function can easily suggest that the
allocation is restricted to the given node and fails otherwise. In truth, the
node is only preferred, unless __GFP_THISNODE is passed among the gfp flags.
The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm,
thp: really limit transparent hugepage allocation to local node") and
b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node").
To prevent further mistakes and provide a convenience function for allocations
truly restricted to a node, this patch makes alloc_pages_exact_node() pass
__GFP_THISNODE to that effect. The previous implementation of
alloc_pages_exact_node() is copied as __alloc_pages_node() which implies it's
an optimized variant of __alloc_pages_node() not intended for general usage.
All three functions are described in the comment.
Existing callers of alloc_pages_exact_node() are adjusted as follows:
- those that explicitly pass __GFP_THISNODE keep calling
alloc_pages_exact_node(), but the flag is removed from the call
- others are converted to call __alloc_pages_node(). Some may still pass
__GFP_THISNODE if they serve as wrappers that get gfp_flags from higher
layers.
There's exception of sba_alloc_coherent() which open-codes the check for
nid == -1, so it is converted to use alloc_pages_node() instead. This means
it no longer performs some VM_BUG_ON checks, but otherwise the whole patch
makes no functional changes.
Signed-off-by: Vlastimil Babka <[email protected]>
---
I've dropped non-mm guys from CC for now and marked as RFC until we agree on
the API.
arch/ia64/hp/common/sba_iommu.c | 6 +-----
arch/ia64/kernel/uncached.c | 2 +-
arch/ia64/sn/pci/pci_dma.c | 2 +-
arch/powerpc/platforms/cell/ras.c | 2 +-
arch/x86/kvm/vmx.c | 2 +-
drivers/misc/sgi-xp/xpc_uv.c | 2 +-
include/linux/gfp.h | 23 +++++++++++++++++++++++
kernel/profile.c | 8 ++++----
mm/filemap.c | 2 +-
mm/hugetlb.c | 7 +++----
mm/memory-failure.c | 2 +-
mm/mempolicy.c | 6 ++----
mm/migrate.c | 6 ++----
mm/slab.c | 2 +-
mm/slob.c | 2 +-
mm/slub.c | 2 +-
16 files changed, 45 insertions(+), 31 deletions(-)
diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 344387a..a6d6190 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1140,13 +1140,9 @@ sba_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
#ifdef CONFIG_NUMA
{
- int node = ioc->node;
struct page *page;
- if (node == NUMA_NO_NODE)
- node = numa_node_id();
-
- page = alloc_pages_exact_node(node, flags, get_order(size));
+ page = alloc_pages_node(ioc->node, flags, get_order(size));
if (unlikely(!page))
return NULL;
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 20e8a9b..b187c87 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,7 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
/* attempt to allocate a granule's worth of cached memory pages */
page = alloc_pages_exact_node(nid,
- GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
+ GFP_KERNEL | __GFP_ZERO,
IA64_GRANULE_SHIFT-PAGE_SHIFT);
if (!page) {
mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d0853e8..8f59907 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -92,7 +92,7 @@ static void *sn_dma_alloc_coherent(struct device *dev, size_t size,
*/
node = pcibus_to_node(pdev->bus);
if (likely(node >=0)) {
- struct page *p = alloc_pages_exact_node(node,
+ struct page *p = __alloc_pages_node(node,
flags, get_order(size));
if (likely(p))
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index e865d74..ff5ae13 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -124,7 +124,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
area->nid = nid;
area->order = order;
area->pages = alloc_pages_exact_node(area->nid,
- GFP_KERNEL|__GFP_THISNODE,
+ GFP_KERNEL,
area->order);
if (!area->pages) {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2d73807..8c7f3b0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
struct page *pages;
struct vmcs *vmcs;
- pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
+ pages = __alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
if (!pages)
return NULL;
vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 95c8944..a4758cd 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -240,7 +240,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
nid = cpu_to_node(cpu);
page = alloc_pages_exact_node(nid,
- GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
+ GFP_KERNEL | __GFP_ZERO,
pg_order);
if (page == NULL) {
dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 15928f0..c50848e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -300,6 +300,22 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
}
+/*
+ * An optimized version of alloc_pages_node(), to be only used in places where
+ * the overhead of the check for nid == -1 could matter.
+ */
+static inline struct page *
+__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
+{
+ VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+
+ return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+}
+
+/*
+ * Allocate pages, preferring the node given as nid. When nid equals -1,
+ * prefer the current CPU's node.
+ */
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
@@ -310,11 +326,18 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
+/*
+ * Allocate pages, restricting the allocation to the node given as nid. The
+ * node must be valid and online. This is achieved by adding __GFP_THISNODE
+ * to gfp_mask.
+ */
static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+ gfp_mask |= __GFP_THISNODE;
+
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
diff --git a/kernel/profile.c b/kernel/profile.c
index a7bcd28..30a9404 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -339,7 +339,7 @@ static int profile_cpu_callback(struct notifier_block *info,
node = cpu_to_mem(cpu);
per_cpu(cpu_profile_flip, cpu) = 0;
if (!per_cpu(cpu_profile_hits, cpu)[1]) {
- page = alloc_pages_exact_node(node,
+ page = __alloc_pages_node(node,
GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
@@ -347,7 +347,7 @@ static int profile_cpu_callback(struct notifier_block *info,
per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
}
if (!per_cpu(cpu_profile_hits, cpu)[0]) {
- page = alloc_pages_exact_node(node,
+ page = __alloc_pages_node(node,
GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
@@ -544,14 +544,14 @@ static int create_hash_tables(void)
struct page *page;
page = alloc_pages_exact_node(node,
- GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
+ GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
goto out_cleanup;
per_cpu(cpu_profile_hits, cpu)[1]
= (struct profile_hit *)page_address(page);
page = alloc_pages_exact_node(node,
- GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
+ GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
goto out_cleanup;
diff --git a/mm/filemap.c b/mm/filemap.c
index 6bf5e42..5a7d4e2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -648,7 +648,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
do {
cpuset_mems_cookie = read_mems_allowed_begin();
n = cpuset_mem_spread_node();
- page = alloc_pages_exact_node(n, gfp, 0);
+ page = __alloc_pages_node(n, gfp, 0);
} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 271e443..156d8d7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1089,8 +1089,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
struct page *page;
page = alloc_pages_exact_node(nid,
- htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
- __GFP_REPEAT|__GFP_NOWARN,
+ htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN,
huge_page_order(h));
if (page) {
if (arch_prepare_hugepage(page)) {
@@ -1251,8 +1250,8 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
huge_page_order(h));
else
page = alloc_pages_exact_node(nid,
- htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
- __GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
+ htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|
+ __GFP_NOWARN, huge_page_order(h));
if (page && arch_prepare_hugepage(page)) {
__free_pages(page, huge_page_order(h));
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 501820c..b783bc5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1503,7 +1503,7 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
return alloc_huge_page_node(page_hstate(compound_head(p)),
nid);
else
- return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
+ return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
}
/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7477432..4547960 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -945,8 +945,7 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
else
- return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE |
- __GFP_THISNODE, 0);
+ return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
}
/*
@@ -1986,8 +1985,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(node, *nmask)) {
mpol_cond_put(pol);
- page = alloc_pages_exact_node(node,
- gfp | __GFP_THISNODE, order);
+ page = alloc_pages_exact_node(node, gfp, order);
goto out;
}
}
diff --git a/mm/migrate.c b/mm/migrate.c
index f53838f..d139222 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1554,10 +1554,8 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
struct page *newpage;
newpage = alloc_pages_exact_node(nid,
- (GFP_HIGHUSER_MOVABLE |
- __GFP_THISNODE | __GFP_NOMEMALLOC |
- __GFP_NORETRY | __GFP_NOWARN) &
- ~GFP_IOFS, 0);
+ (GFP_HIGHUSER_MOVABLE | __GFP_NOMEMALLOC |
+ __GFP_NORETRY | __GFP_NOWARN) & ~GFP_IOFS, 0);
return newpage;
}
diff --git a/mm/slab.c b/mm/slab.c
index 7eb38dd..5f49e63 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1594,7 +1594,7 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
if (memcg_charge_slab(cachep, flags, cachep->gfporder))
return NULL;
- page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
+ page = __alloc_pages_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
if (!page) {
memcg_uncharge_slab(cachep, cachep->gfporder);
slab_out_of_memory(cachep, flags, nodeid);
diff --git a/mm/slob.c b/mm/slob.c
index 4765f65..10d8e02 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -193,7 +193,7 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
#ifdef CONFIG_NUMA
if (node != NUMA_NO_NODE)
- page = alloc_pages_exact_node(node, gfp, order);
+ page = __alloc_pages_node(node, gfp, order);
else
#endif
page = alloc_pages(gfp, order);
diff --git a/mm/slub.c b/mm/slub.c
index 54c0876..0486343 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1323,7 +1323,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
- page = alloc_pages_exact_node(node, flags, order);
+ page = __alloc_pages_node(node, flags, order);
if (!page)
memcg_uncharge_slab(s, order);
--
2.4.6
Perform the same debug checks in alloc_pages_node() as are done in
alloc_pages_exact_node() and __alloc_pages_node() by making the latter
function the inner core of the former ones.
Change the !node_online(nid) check from VM_BUG_ON to VM_WARN_ON since it's not
fatal and this patch may expose some buggy callers.
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/gfp.h | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c50848e..54c3ee7 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -307,7 +307,8 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
static inline struct page *
__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
{
- VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+ VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+ VM_WARN_ON(!node_online(nid));
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
@@ -319,11 +320,11 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
- /* Unknown node is current node */
- if (nid < 0)
+ /* Unknown node is current (or closest) node */
+ if (nid == NUMA_NO_NODE)
nid = numa_node_id();
- return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+ return __alloc_pages_node(nid, gfp_mask, order);
}
/*
@@ -334,11 +335,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
- VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
-
- gfp_mask |= __GFP_THISNODE;
-
- return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+ return __alloc_pages_node(nid, gfp_mask | __GFP_THISNODE, order);
}
#ifdef CONFIG_NUMA
--
2.4.6
numa_mem_id() is able to handle allocation from CPU's on memory-less nodes,
so it's a more robust fallback than the currently used numa_node_id().
Suggested-by: Christoph Lameter <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/gfp.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 54c3ee7..531c72d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -322,7 +322,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
{
/* Unknown node is current (or closest) node */
if (nid == NUMA_NO_NODE)
- nid = numa_node_id();
+ nid = numa_mem_id();
return __alloc_pages_node(nid, gfp_mask, order);
}
--
2.4.6
alloc_pages_node() now warns when an offline node is passed. Make it fallback
to the local (or nearest) node as if NUMA_NO_NODE nid is passed, but keep the
VM_WARN_ON warning.
Suggested-by: David Rientjes <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
David's suggested if(VM_WARN_ON(...)) doesn't work that way, hence this more
involved and awkward syntax.
include/linux/gfp.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 531c72d..104a027 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -321,8 +321,12 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
/* Unknown node is current (or closest) node */
- if (nid == NUMA_NO_NODE)
+ if (nid == NUMA_NO_NODE) {
nid = numa_mem_id();
+ } else if (!node_online(nid)) {
+ VM_WARN_ON(!node_online(nid));
+ nid = numa_mem_id();
+ }
return __alloc_pages_node(nid, gfp_mask, order);
}
--
2.4.6
On Fri, 24 Jul 2015, Vlastimil Babka wrote:
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 531c72d..104a027 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -321,8 +321,12 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> unsigned int order)
> {
> /* Unknown node is current (or closest) node */
> - if (nid == NUMA_NO_NODE)
> + if (nid == NUMA_NO_NODE) {
> nid = numa_mem_id();
> + } else if (!node_online(nid)) {
> + VM_WARN_ON(!node_online(nid));
> + nid = numa_mem_id();
> + }
I would think you would only want this for debugging purposes. The
overwhelming majority of hardware out there has no memory
onlining/offlining capability after all and this adds the overhead to each
call to alloc_pages_node.
Make this dependo n CONFIG_VM_DEBUG or some such thing?
On Fri, 24 Jul 2015, Christoph Lameter wrote:
> On Fri, 24 Jul 2015, Vlastimil Babka wrote:
>
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 531c72d..104a027 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -321,8 +321,12 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> > unsigned int order)
> > {
> > /* Unknown node is current (or closest) node */
> > - if (nid == NUMA_NO_NODE)
> > + if (nid == NUMA_NO_NODE) {
> > nid = numa_mem_id();
> > + } else if (!node_online(nid)) {
> > + VM_WARN_ON(!node_online(nid));
> > + nid = numa_mem_id();
> > + }
>
> I would think you would only want this for debugging purposes. The
> overwhelming majority of hardware out there has no memory
> onlining/offlining capability after all and this adds the overhead to each
> call to alloc_pages_node.
>
> Make this dependo n CONFIG_VM_DEBUG or some such thing?
>
Yeah, the suggestion was for VM_WARN_ON() in the conditional, but the
placement has changed somewhat because of the new __alloc_pages_node(). I
think
else if (VM_WARN_ON(!node_online(nid)))
nid = numa_mem_id();
should be fine since it only triggers for CONFIG_DEBUG_VM.
On Fri, 24 Jul 2015, Vlastimil Babka wrote:
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 15928f0..c50848e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -300,6 +300,22 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
> return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
> }
>
> +/*
> + * An optimized version of alloc_pages_node(), to be only used in places where
> + * the overhead of the check for nid == -1 could matter.
We don't actually check for nid == -1, or nid == NUMA_NO_NODE, in any of
the functions. I would just state that nid must be valid and possible to
allocate from when passed to this function.
> + */
> +static inline struct page *
> +__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> +{
> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
> +
> + return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> +}
> +
> +/*
> + * Allocate pages, preferring the node given as nid. When nid equals -1,
> + * prefer the current CPU's node.
> + */
We've done quite a bit of work to refer only to NUMA_NO_NODE, so we'd like
to avoid hardcoded -1 anywhere we can.
> static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> unsigned int order)
> {
> @@ -310,11 +326,18 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> }
>
> +/*
> + * Allocate pages, restricting the allocation to the node given as nid. The
> + * node must be valid and online. This is achieved by adding __GFP_THISNODE
> + * to gfp_mask.
Not sure we need to point out that __GPF_THISNODE does this, it stands out
pretty well in the function already :)
> + */
> static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
> unsigned int order)
> {
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>
> + gfp_mask |= __GFP_THISNODE;
> +
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> }
>
[snip]
I assume you looked at the collapse_huge_page() case and decided that it
needs no modification since the gfp mask is used later for other calls?
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f53838f..d139222 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1554,10 +1554,8 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
> struct page *newpage;
>
> newpage = alloc_pages_exact_node(nid,
> - (GFP_HIGHUSER_MOVABLE |
> - __GFP_THISNODE | __GFP_NOMEMALLOC |
> - __GFP_NORETRY | __GFP_NOWARN) &
> - ~GFP_IOFS, 0);
> + (GFP_HIGHUSER_MOVABLE | __GFP_NOMEMALLOC |
> + __GFP_NORETRY | __GFP_NOWARN) & ~GFP_IOFS, 0);
>
> return newpage;
> }
[snip]
What about the alloc_pages_exact_node() in new_page_node()?
On Fri, 24 Jul 2015, Vlastimil Babka wrote:
> Perform the same debug checks in alloc_pages_node() as are done in
> alloc_pages_exact_node() and __alloc_pages_node() by making the latter
> function the inner core of the former ones.
>
> Change the !node_online(nid) check from VM_BUG_ON to VM_WARN_ON since it's not
> fatal and this patch may expose some buggy callers.
>
> Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
On Fri, 24 Jul 2015, Vlastimil Babka wrote:
> numa_mem_id() is able to handle allocation from CPU's on memory-less nodes,
> so it's a more robust fallback than the currently used numa_node_id().
>
> Suggested-by: Christoph Lameter <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
On 24.7.2015 21:54, David Rientjes wrote:
> On Fri, 24 Jul 2015, Christoph Lameter wrote:
>
>> On Fri, 24 Jul 2015, Vlastimil Babka wrote:
>>
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index 531c72d..104a027 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -321,8 +321,12 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>>> unsigned int order)
>>> {
>>> /* Unknown node is current (or closest) node */
>>> - if (nid == NUMA_NO_NODE)
>>> + if (nid == NUMA_NO_NODE) {
>>> nid = numa_mem_id();
>>> + } else if (!node_online(nid)) {
>>> + VM_WARN_ON(!node_online(nid));
>>> + nid = numa_mem_id();
>>> + }
>>
>> I would think you would only want this for debugging purposes. The
>> overwhelming majority of hardware out there has no memory
>> onlining/offlining capability after all and this adds the overhead to each
>> call to alloc_pages_node.
>>
>> Make this dependo n CONFIG_VM_DEBUG or some such thing?
>>
>
> Yeah, the suggestion was for VM_WARN_ON() in the conditional, but the
> placement has changed somewhat because of the new __alloc_pages_node(). I
> think
>
> else if (VM_WARN_ON(!node_online(nid)))
> nid = numa_mem_id();
>
> should be fine since it only triggers for CONFIG_DEBUG_VM.
Um, so on your original suggestion I thought that you assumed that the condition
inside VM_WARN_ON is evaluated regardless of CONFIG_DEBUG_VM, it just will or
will not generate a warning. Which is how BUG_ON works, but VM_WARN_ON (and
VM_BUG_ON) doesn't. IIUC VM_WARN_ON() with !CONFIG_DEBUG_VM will always be false.
Because I didn't think you would suggest the "nid = numa_mem_id()" for
!node_online(nid) fixup would happen only for CONFIG_DEBUG_VM kernels. But it
seems that you do suggest that? I would understand if the fixup (correcting an
offline node to some that's online) was done regardless of DEBUG_VM, and
DEBUG_VM just switched between silent and noisy fixup. But having a debug option
alter the outcome seems wrong?
Am I correct that passing an offline node is not fatal, just the zonelist will
be empty and the allocation will fail? Now without DEBUG_VM it would silently
fail, and with DEBUG_VM it would warn, but succeed on another node.
So either we do fixup regardless of DEBUG_VM, or drop this patch, as the
VM_WARN_ON(!node_online(nid)) is already done in __alloc_pages_node() thanks to
patch 2/4?
On 24.7.2015 22:08, David Rientjes wrote:
> On Fri, 24 Jul 2015, Vlastimil Babka wrote:
>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 15928f0..c50848e 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -300,6 +300,22 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
>> return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
>> }
>>
>> +/*
>> + * An optimized version of alloc_pages_node(), to be only used in places where
>> + * the overhead of the check for nid == -1 could matter.
>
> We don't actually check for nid == -1, or nid == NUMA_NO_NODE, in any of
> the functions. I would just state that nid must be valid and possible to
> allocate from when passed to this function.
OK
>> + */
>> +static inline struct page *
>> +__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> +{
>> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>> +
>> + return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>> +}
>> +
>> +/*
>> + * Allocate pages, preferring the node given as nid. When nid equals -1,
>> + * prefer the current CPU's node.
>> + */
>
> We've done quite a bit of work to refer only to NUMA_NO_NODE, so we'd like
> to avoid hardcoded -1 anywhere we can.
OK
>> static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>> unsigned int order)
>> {
>> @@ -310,11 +326,18 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>> }
>>
>> +/*
>> + * Allocate pages, restricting the allocation to the node given as nid. The
>> + * node must be valid and online. This is achieved by adding __GFP_THISNODE
>> + * to gfp_mask.
>
> Not sure we need to point out that __GPF_THISNODE does this, it stands out
> pretty well in the function already :)
Right.
>> + */
>> static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
>> unsigned int order)
>> {
>> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>>
>> + gfp_mask |= __GFP_THISNODE;
>> +
>> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>> }
>>
> [snip]
>
> I assume you looked at the collapse_huge_page() case and decided that it
> needs no modification since the gfp mask is used later for other calls?
Yeah. Not that the memcg charge parts would seem to care about __GFP_THISNODE,
though.
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index f53838f..d139222 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1554,10 +1554,8 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
>> struct page *newpage;
>>
>> newpage = alloc_pages_exact_node(nid,
>> - (GFP_HIGHUSER_MOVABLE |
>> - __GFP_THISNODE | __GFP_NOMEMALLOC |
>> - __GFP_NORETRY | __GFP_NOWARN) &
>> - ~GFP_IOFS, 0);
>> + (GFP_HIGHUSER_MOVABLE | __GFP_NOMEMALLOC |
>> + __GFP_NORETRY | __GFP_NOWARN) & ~GFP_IOFS, 0);
>>
>> return newpage;
>> }
> [snip]
>
> What about the alloc_pages_exact_node() in new_page_node()?
Oops, seems I missed that one. So the API seems ok otherwise?
On Fri, 24 Jul 2015, Vlastimil Babka wrote:
> >>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >>> index 531c72d..104a027 100644
> >>> --- a/include/linux/gfp.h
> >>> +++ b/include/linux/gfp.h
> >>> @@ -321,8 +321,12 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> >>> unsigned int order)
> >>> {
> >>> /* Unknown node is current (or closest) node */
> >>> - if (nid == NUMA_NO_NODE)
> >>> + if (nid == NUMA_NO_NODE) {
> >>> nid = numa_mem_id();
> >>> + } else if (!node_online(nid)) {
> >>> + VM_WARN_ON(!node_online(nid));
> >>> + nid = numa_mem_id();
> >>> + }
> >>
> >> I would think you would only want this for debugging purposes. The
> >> overwhelming majority of hardware out there has no memory
> >> onlining/offlining capability after all and this adds the overhead to each
> >> call to alloc_pages_node.
> >>
> >> Make this dependo n CONFIG_VM_DEBUG or some such thing?
> >>
> >
> > Yeah, the suggestion was for VM_WARN_ON() in the conditional, but the
> > placement has changed somewhat because of the new __alloc_pages_node(). I
> > think
> >
> > else if (VM_WARN_ON(!node_online(nid)))
> > nid = numa_mem_id();
> >
> > should be fine since it only triggers for CONFIG_DEBUG_VM.
>
> Um, so on your original suggestion I thought that you assumed that the condition
> inside VM_WARN_ON is evaluated regardless of CONFIG_DEBUG_VM, it just will or
> will not generate a warning. Which is how BUG_ON works, but VM_WARN_ON (and
> VM_BUG_ON) doesn't. IIUC VM_WARN_ON() with !CONFIG_DEBUG_VM will always be false.
Right, that's what Christoph is also suggesting. VM_WARN_ON without
CONFIG_DEBUG_VM should permit the compiler to check the expression but not
generate any code and we don't want to check node_online() here for every
allocation, it's only a debugging measure.
> Because I didn't think you would suggest the "nid = numa_mem_id()" for
> !node_online(nid) fixup would happen only for CONFIG_DEBUG_VM kernels. But it
> seems that you do suggest that? I would understand if the fixup (correcting an
> offline node to some that's online) was done regardless of DEBUG_VM, and
> DEBUG_VM just switched between silent and noisy fixup. But having a debug option
> alter the outcome seems wrong?
Hmm, not sure why this is surprising, I don't expect people to deploy
production kernels with CONFIG_DEBUG_VM enabled, it's far too expensive.
I was expecting they would enable it for, well... debug :)
In that case, if nid is a valid node but offline, then the nid =
numa_mem_id() fixup seems fine to allow the kernel to continue debugging.
When a node is offlined as a result of memory hotplug, the pgdat doesn't
get freed so it can be onlined later. Thus, alloc_pages_node() with an
offline node and !CONFIG_DEBUG_VM may not panic. If it does, this can
probably be removed because we're covered.
On Fri, 24 Jul 2015, Vlastimil Babka wrote:
> > I assume you looked at the collapse_huge_page() case and decided that it
> > needs no modification since the gfp mask is used later for other calls?
>
> Yeah. Not that the memcg charge parts would seem to care about __GFP_THISNODE,
> though.
>
Hmm, not sure that memcg would ever care about __GFP_THISNODE. I wonder
if it make more sense to remove setting __GFP_THISNODE in
collapse_huge_page()? khugepaged_alloc_page() seems fine with the new
alloc_pages_exact_node() semantics.
> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index f53838f..d139222 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -1554,10 +1554,8 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
> >> struct page *newpage;
> >>
> >> newpage = alloc_pages_exact_node(nid,
> >> - (GFP_HIGHUSER_MOVABLE |
> >> - __GFP_THISNODE | __GFP_NOMEMALLOC |
> >> - __GFP_NORETRY | __GFP_NOWARN) &
> >> - ~GFP_IOFS, 0);
> >> + (GFP_HIGHUSER_MOVABLE | __GFP_NOMEMALLOC |
> >> + __GFP_NORETRY | __GFP_NOWARN) & ~GFP_IOFS, 0);
> >>
> >> return newpage;
> >> }
> > [snip]
> >
> > What about the alloc_pages_exact_node() in new_page_node()?
>
> Oops, seems I missed that one. So the API seems ok otherwise?
>
Yup! And I believe that this patch doesn't cause any regression after the
new_page_node() issue is fixed.
On 07/25/2015 01:06 AM, David Rientjes wrote:
> On Fri, 24 Jul 2015, Vlastimil Babka wrote:
>
>> Because I didn't think you would suggest the "nid = numa_mem_id()" for
>> !node_online(nid) fixup would happen only for CONFIG_DEBUG_VM kernels. But it
>> seems that you do suggest that? I would understand if the fixup (correcting an
>> offline node to some that's online) was done regardless of DEBUG_VM, and
>> DEBUG_VM just switched between silent and noisy fixup. But having a debug option
>> alter the outcome seems wrong?
>
> Hmm, not sure why this is surprising, I don't expect people to deploy
> production kernels with CONFIG_DEBUG_VM enabled, it's far too expensive.
> I was expecting they would enable it for, well... debug :)
But is there any other place that does such thing for debug builds?
> In that case, if nid is a valid node but offline, then the nid =
> numa_mem_id() fixup seems fine to allow the kernel to continue debugging.
>
> When a node is offlined as a result of memory hotplug, the pgdat doesn't
> get freed so it can be onlined later. Thus, alloc_pages_node() with an
> offline node and !CONFIG_DEBUG_VM may not panic. If it does, this can
> probably be removed because we're covered.
I've checked, but can't say I understand the hotplug code completely...
but it seems there are two cases
- the node was never online, and the nid passed to alloc_pages_node() is
clearly bogus. Then there's no pgdat and it should crash on NULL pointer
dereference. VM_WARN_ON() in __alloc_pages_node() will already catch
this and provide more details as to what caused the crash. Fixup would
allow "continue debugging", but it seems that having configured e.g. a
crashdump to inspect is a better way to debug than letting the kernel
continue?
- the node has been online in the past, so the nid pointing to an
offline node might be due to a race with offlining. It shouldn't crash,
and most likely the zonelist that is left untouched by the offlining
(AFAICS) will allow fallback to other nodes. Unless there is a nodemask
of __GFP_THIS_NODE, in which case allocation fails. Again, VM_WARN_ON()
in __alloc_pages_node() will warn us already. I doubt the fixup is
needed here?
So I would just drop this patch. We already have the debug warning in
__alloc_pages_node(), and a fixup is imho just confusing.
On Fri, Jul 24, 2015 at 04:45:23PM +0200, Vlastimil Babka wrote:
> @@ -310,11 +326,18 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> }
>
> +/*
> + * Allocate pages, restricting the allocation to the node given as nid. The
> + * node must be valid and online. This is achieved by adding __GFP_THISNODE
> + * to gfp_mask.
> + */
> static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
> unsigned int order)
> {
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>
> + gfp_mask |= __GFP_THISNODE;
> +
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> }
The "exact" name is currently ambiguous within the allocator API, and
it's bad that we have _exact_node() and _exact_nid() with entirely
different meanings. It'd be good to make "thisnode" refer to specific
and exclusive node requests, and "exact" to mean page allocation
chunks that are not in powers of two.
Would you consider renaming this function to alloc_pages_thisnode() as
part of this series?
On 07/27/2015 05:39 PM, Johannes Weiner wrote:
> On Fri, Jul 24, 2015 at 04:45:23PM +0200, Vlastimil Babka wrote:
>> @@ -310,11 +326,18 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>> }
>>
>> +/*
>> + * Allocate pages, restricting the allocation to the node given as nid. The
>> + * node must be valid and online. This is achieved by adding __GFP_THISNODE
>> + * to gfp_mask.
>> + */
>> static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
>> unsigned int order)
>> {
>> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>>
>> + gfp_mask |= __GFP_THISNODE;
>> +
>> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>> }
>
> The "exact" name is currently ambiguous within the allocator API, and
> it's bad that we have _exact_node() and _exact_nid() with entirely
> different meanings. It'd be good to make "thisnode" refer to specific
> and exclusive node requests, and "exact" to mean page allocation
> chunks that are not in powers of two.
Ugh, good point.
> Would you consider renaming this function to alloc_pages_thisnode() as
> part of this series?
Sure, let's do it properly while at it. Yet "thisnode" is somewhat
misleading name as it might imply the cpu's local node. The same applies
to __GFP_THISNODE. So maybe find a better name for both? restrict_node?
single_node?
On Fri, Jul 24, 2015 at 04:45:23PM +0200, Vlastimil Babka wrote:
> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
No gold stars for that one.
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node and fails otherwise. In truth, the
> node is only preferred, unless __GFP_THISNODE is passed among the gfp flags.
>
> The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm,
> thp: really limit transparent hugepage allocation to local node") and
> b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node").
>
> To prevent further mistakes and provide a convenience function for allocations
> truly restricted to a node, this patch makes alloc_pages_exact_node() pass
> __GFP_THISNODE to that effect. The previous implementation of
The change of what we have now is a good idea. What you have is a solid
improvement in my view but I see there are a few different suggestions
in the thread. Based on that I think it makes sense to just destroy
alloc_pages_exact_node. In the future "exact" in the allocator API will
mean "exactly this number of pages". Use your __alloc_pages_node helper
and specify __GFP_THISNODE if the caller requires that specific node.
> alloc_pages_exact_node() is copied as __alloc_pages_node() which implies it's
> an optimized variant of __alloc_pages_node() not intended for general usage.
> All three functions are described in the comment.
>
> Existing callers of alloc_pages_exact_node() are adjusted as follows:
> - those that explicitly pass __GFP_THISNODE keep calling
> alloc_pages_exact_node(), but the flag is removed from the call
__alloc_pages_node(__GFP_THISNODE) would be harder to get wrong in the future
> - others are converted to call __alloc_pages_node(). Some may still pass
> __GFP_THISNODE if they serve as wrappers that get gfp_flags from higher
> layers.
>
> There's exception of sba_alloc_coherent() which open-codes the check for
> nid == -1, so it is converted to use alloc_pages_node() instead. This means
> it no longer performs some VM_BUG_ON checks, but otherwise the whole patch
> makes no functional changes.
>
In general, checks for -1 should go away, particularly with new patches.
Use NUMA_NO_NODE.
--
Mel Gorman
SUSE Labs
On Fri, Jul 24, 2015 at 04:45:25PM +0200, Vlastimil Babka wrote:
> numa_mem_id() is able to handle allocation from CPU's on memory-less nodes,
> so it's a more robust fallback than the currently used numa_node_id().
>
> Suggested-by: Christoph Lameter <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Wed, 29 Jul 2015, Mel Gorman wrote:
> The change of what we have now is a good idea. What you have is a solid
> improvement in my view but I see there are a few different suggestions
> in the thread. Based on that I think it makes sense to just destroy
> alloc_pages_exact_node. In the future "exact" in the allocator API will
> mean "exactly this number of pages". Use your __alloc_pages_node helper
> and specify __GFP_THISNODE if the caller requires that specific node.
Yes please.
On Wed, Jul 29, 2015 at 02:30:43PM +0100, Mel Gorman wrote:
> The change of what we have now is a good idea. What you have is a solid
> improvement in my view but I see there are a few different suggestions
> in the thread. Based on that I think it makes sense to just destroy
> alloc_pages_exact_node. In the future "exact" in the allocator API will
> mean "exactly this number of pages". Use your __alloc_pages_node helper
> and specify __GFP_THISNODE if the caller requires that specific node.
+1