The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is valid")
as an optimized variant of alloc_pages_node(), that doesn't allow the node id
to be -1. Unfortunately the name of the function can easily suggest that the
allocation is restricted to the given node. In truth, the node is only
preferred, unless __GFP_THISNODE is among the gfp flags.
The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm,
thp: really limit transparent hugepage allocation to local node") and
b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node").
To prevent further mistakes, this patch renames the function to
alloc_pages_prefer_node() and documents it together with alloc_pages_node().
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Gleb Natapov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Cliff Whickman <[email protected]>
Cc: Robin Holt <[email protected]>
---
I hope the new name will be OK. Although it doesn't fully convey the
difference from alloc_pages_node(), renaming the latter would be a much
larger patch (some 60 callsites) and hopefully the new comments are enough to
prevent further confusion.
I'm CC'ing also maintainers of the callsites so they can verify that the
callsites that don't pass __GFP_THISNODE are really not intended to restrict
allocation to the given node. I went through them myself and each looked like
it's better off if it can successfully allocate on a fallback node rather
than fail. DavidR checked them also I think, but it's better if maintainers
can verify that. I'm not completely sure about all the usages in sl*b due to
multiple layers through which gfp flags are being passed.
arch/ia64/hp/common/sba_iommu.c | 2 +-
arch/ia64/kernel/uncached.c | 2 +-
arch/ia64/sn/pci/pci_dma.c | 2 +-
arch/powerpc/platforms/cell/ras.c | 2 +-
arch/x86/kvm/vmx.c | 2 +-
drivers/misc/sgi-xp/xpc_uv.c | 2 +-
include/linux/gfp.h | 14 ++++++++++++--
kernel/profile.c | 8 ++++----
mm/filemap.c | 2 +-
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 4 ++--
mm/memory-failure.c | 2 +-
mm/mempolicy.c | 4 ++--
mm/migrate.c | 4 ++--
mm/page_alloc.c | 2 --
mm/slab.c | 2 +-
mm/slob.c | 4 ++--
mm/slub.c | 2 +-
18 files changed, 35 insertions(+), 27 deletions(-)
diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 344387a..17762af 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1146,7 +1146,7 @@ sba_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
if (node == NUMA_NO_NODE)
node = numa_node_id();
- page = alloc_pages_exact_node(node, flags, get_order(size));
+ page = alloc_pages_prefer_node(node, flags, get_order(size));
if (unlikely(!page))
return NULL;
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 20e8a9b..1451961 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -97,7 +97,7 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
/* attempt to allocate a granule's worth of cached memory pages */
- page = alloc_pages_exact_node(nid,
+ page = alloc_pages_prefer_node(nid,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
IA64_GRANULE_SHIFT-PAGE_SHIFT);
if (!page) {
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d0853e8..dcab82f 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -92,7 +92,7 @@ static void *sn_dma_alloc_coherent(struct device *dev, size_t size,
*/
node = pcibus_to_node(pdev->bus);
if (likely(node >=0)) {
- struct page *p = alloc_pages_exact_node(node,
+ struct page *p = alloc_pages_prefer_node(node,
flags, get_order(size));
if (likely(p))
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index e865d74..646a310 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -123,7 +123,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
area->nid = nid;
area->order = order;
- area->pages = alloc_pages_exact_node(area->nid,
+ area->pages = alloc_pages_prefer_node(area->nid,
GFP_KERNEL|__GFP_THISNODE,
area->order);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2d73807..a8723a8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
struct page *pages;
struct vmcs *vmcs;
- pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
+ pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order);
if (!pages)
return NULL;
vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 95c8944..3ff0a24 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -239,7 +239,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
mq->mmr_blade = uv_cpu_to_blade_id(cpu);
nid = cpu_to_node(cpu);
- page = alloc_pages_exact_node(nid,
+ page = alloc_pages_prefer_node(nid,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
pg_order);
if (page == NULL) {
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 15928f0..ff892d7 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -300,6 +300,10 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
}
+/*
+ * Allocate pages, preferring the node given as nid. When nid equals -1,
+ * prefer the current CPU's node.
+ */
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
@@ -310,7 +314,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
-static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+/*
+ * Allocate pages, preferring the node given as nid. Unlike alloc_pages_node(),
+ * nid must be a valid online node, there is no fallback to current CPU's node.
+ *
+ * In order to completely restrict allocation to the preferred node, include
+ * __GFP_THISNODE in the gfp_mask.
+ */
+static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
@@ -354,7 +365,6 @@ extern unsigned long get_zeroed_page(gfp_t gfp_mask);
void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
void free_pages_exact(void *virt, size_t size);
-/* This is different from alloc_pages_exact_node !!! */
void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
#define __get_free_page(gfp_mask) \
diff --git a/kernel/profile.c b/kernel/profile.c
index a7bcd28..6ee695e 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -339,7 +339,7 @@ static int profile_cpu_callback(struct notifier_block *info,
node = cpu_to_mem(cpu);
per_cpu(cpu_profile_flip, cpu) = 0;
if (!per_cpu(cpu_profile_hits, cpu)[1]) {
- page = alloc_pages_exact_node(node,
+ page = alloc_pages_prefer_node(node,
GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
@@ -347,7 +347,7 @@ static int profile_cpu_callback(struct notifier_block *info,
per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
}
if (!per_cpu(cpu_profile_hits, cpu)[0]) {
- page = alloc_pages_exact_node(node,
+ page = alloc_pages_prefer_node(node,
GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
@@ -543,14 +543,14 @@ static int create_hash_tables(void)
int node = cpu_to_mem(cpu);
struct page *page;
- page = alloc_pages_exact_node(node,
+ page = alloc_pages_prefer_node(node,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
0);
if (!page)
goto out_cleanup;
per_cpu(cpu_profile_hits, cpu)[1]
= (struct profile_hit *)page_address(page);
- page = alloc_pages_exact_node(node,
+ page = alloc_pages_prefer_node(node,
GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
0);
if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 6bf5e42..52e3b55 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -648,7 +648,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
do {
cpuset_mems_cookie = read_mems_allowed_begin();
n = cpuset_mem_spread_node();
- page = alloc_pages_exact_node(n, gfp, 0);
+ page = alloc_pages_prefer_node(n, gfp, 0);
} while (!page && read_mems_allowed_retry(cpuset_mems_cookie));
return page;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 078832c..7130e97 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2350,7 +2350,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
*/
up_read(&mm->mmap_sem);
- *hpage = alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER);
+ *hpage = alloc_pages_prefer_node(node, gfp, HPAGE_PMD_ORDER);
if (unlikely(!*hpage)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
*hpage = ERR_PTR(-ENOMEM);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 271e443..4c6c0c1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1088,7 +1088,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
{
struct page *page;
- page = alloc_pages_exact_node(nid,
+ page = alloc_pages_prefer_node(nid,
htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
__GFP_REPEAT|__GFP_NOWARN,
huge_page_order(h));
@@ -1250,7 +1250,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
__GFP_REPEAT|__GFP_NOWARN,
huge_page_order(h));
else
- page = alloc_pages_exact_node(nid,
+ page = alloc_pages_prefer_node(nid,
htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
__GFP_REPEAT|__GFP_NOWARN, huge_page_order(h));
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 501820c..132f043 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1503,7 +1503,7 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
return alloc_huge_page_node(page_hstate(compound_head(p)),
nid);
else
- return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0);
+ return alloc_pages_prefer_node(nid, GFP_HIGHUSER_MOVABLE, 0);
}
/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7477432..2e2a876 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -945,7 +945,7 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x
return alloc_huge_page_node(page_hstate(compound_head(page)),
node);
else
- return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE |
+ return alloc_pages_prefer_node(node, GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE, 0);
}
@@ -1986,7 +1986,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(node, *nmask)) {
mpol_cond_put(pol);
- page = alloc_pages_exact_node(node,
+ page = alloc_pages_prefer_node(node,
gfp | __GFP_THISNODE, order);
goto out;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index f53838f..b51e106 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1187,7 +1187,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
return alloc_huge_page_node(page_hstate(compound_head(p)),
pm->node);
else
- return alloc_pages_exact_node(pm->node,
+ return alloc_pages_prefer_node(pm->node,
GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
}
@@ -1553,7 +1553,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
int nid = (int) data;
struct page *newpage;
- newpage = alloc_pages_exact_node(nid,
+ newpage = alloc_pages_prefer_node(nid,
(GFP_HIGHUSER_MOVABLE |
__GFP_THISNODE | __GFP_NOMEMALLOC |
__GFP_NORETRY | __GFP_NOWARN) &
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..e6eef56 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3062,8 +3062,6 @@ EXPORT_SYMBOL(alloc_pages_exact);
*
* Like alloc_pages_exact(), but try to allocate on node nid first before falling
* back.
- * Note this is not alloc_pages_exact_node() which allocates on a specific node,
- * but is not exact.
*/
void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
{
diff --git a/mm/slab.c b/mm/slab.c
index 7eb38dd..471fde8 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1594,7 +1594,7 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
if (memcg_charge_slab(cachep, flags, cachep->gfporder))
return NULL;
- page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
+ page = alloc_pages_prefer_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
if (!page) {
memcg_uncharge_slab(cachep, cachep->gfporder);
slab_out_of_memory(cachep, flags, nodeid);
diff --git a/mm/slob.c b/mm/slob.c
index 4765f65..cfda5e2 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -45,7 +45,7 @@
* NUMA support in SLOB is fairly simplistic, pushing most of the real
* logic down to the page allocator, and simply doing the node accounting
* on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_exact_node() with the specified node id is used
+ * provided, alloc_pages_prefer_node() with the specified node id is used
* instead. The common case (or when the node id isn't explicitly provided)
* will default to the current node, as per numa_node_id().
*
@@ -193,7 +193,7 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)
#ifdef CONFIG_NUMA
if (node != NUMA_NO_NODE)
- page = alloc_pages_exact_node(node, gfp, order);
+ page = alloc_pages_prefer_node(node, gfp, order);
else
#endif
page = alloc_pages(gfp, order);
diff --git a/mm/slub.c b/mm/slub.c
index 54c0876..8b10404 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1323,7 +1323,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
- page = alloc_pages_exact_node(node, flags, order);
+ page = alloc_pages_prefer_node(node, flags, order);
if (!page)
memcg_uncharge_slab(s, order);
--
2.4.5
On Tue, 21 Jul 2015, Vlastimil Babka wrote:
> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node. In truth, the node is only
> preferred, unless __GFP_THISNODE is among the gfp flags.
Yup. I complained about this when this was introduced. Glad to see this
fixed. Initially this was alloc_pages_node() which just means that a node
is specified. The exact behavior of the allocation is determined by flags
such as GFP_THISNODE. I'd rather have that restored because otherwise we
get into weird code like the one below. And such an arrangement also
leaves the way open to add more flags in the future that may change the
allocation behavior.
> area->nid = nid;
> area->order = order;
> - area->pages = alloc_pages_exact_node(area->nid,
> + area->pages = alloc_pages_prefer_node(area->nid,
> GFP_KERNEL|__GFP_THISNODE,
> area->order);
This is not preferring a node but requiring alloction on that node.
On Tue, Jul 21, 2015 at 8:55 AM, Vlastimil Babka <[email protected]> wrote:
> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node. In truth, the node is only
> preferred, unless __GFP_THISNODE is among the gfp flags.
>
...
> Cc: Robin Holt <[email protected]>
Acked-by: Robin Holt <[email protected]>
On Tue, 21 Jul 2015, Vlastimil Babka wrote:
> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node. In truth, the node is only
> preferred, unless __GFP_THISNODE is among the gfp flags.
>
> The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm,
> thp: really limit transparent hugepage allocation to local node") and
> b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node").
>
> To prevent further mistakes, this patch renames the function to
> alloc_pages_prefer_node() and documents it together with alloc_pages_node().
>
alloc_pages_exact_node(), as you said, connotates that the allocation will
take place on that node or will fail. So why not go beyond this patch and
actually make alloc_pages_exact_node() set __GFP_THISNODE and then call
into a new alloc_pages_prefer_node(), which would be the current
alloc_pages_exact_node() implementation, and then fix up the callers?
On Tue, 2015-07-21 at 15:55 +0200, Vlastimil Babka wrote:
> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
> allocator: do not check NUMA node ID when the caller knows the node is valid")
> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
> to be -1. Unfortunately the name of the function can easily suggest that the
> allocation is restricted to the given node. In truth, the node is only
> preferred, unless __GFP_THISNODE is among the gfp flags.
>
> The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm,
> thp: really limit transparent hugepage allocation to local node") and
> b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node").
>
> To prevent further mistakes, this patch renames the function to
> alloc_pages_prefer_node() and documents it together with alloc_pages_node().
>
> Signed-off-by: Vlastimil Babka <[email protected]>
> I'm CC'ing also maintainers of the callsites so they can verify that the
> callsites that don't pass __GFP_THISNODE are really not intended to restrict
> allocation to the given node. I went through them myself and each looked like
> it's better off if it can successfully allocate on a fallback node rather
> than fail. DavidR checked them also I think, but it's better if maintainers
> can verify that. I'm not completely sure about all the usages in sl*b due to
> multiple layers through which gfp flags are being passed.
> diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
> index e865d74..646a310 100644
> --- a/arch/powerpc/platforms/cell/ras.c
> +++ b/arch/powerpc/platforms/cell/ras.c
> @@ -123,7 +123,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
>
> area->nid = nid;
> area->order = order;
> - area->pages = alloc_pages_exact_node(area->nid,
> + area->pages = alloc_pages_prefer_node(area->nid,
> GFP_KERNEL|__GFP_THISNODE,
> area->order);
Yeah that looks right to me.
This code enables some firmware memory calibration so I think it really does
want to get memory on the specified node, or else fail.
Acked-by: Michael Ellerman <[email protected]>
cheers
On 21/07/2015 15:55, Vlastimil Babka wrote:
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2d73807..a8723a8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
> struct page *pages;
> struct vmcs *vmcs;
>
> - pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
> + pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order);
> if (!pages)
> return NULL;
> vmcs = page_address(pages);
Even though there's a pretty strong preference for the "right" node,
things can work if the node is the wrong one. The order is always zero
in practice, so the allocation should succeed.
Paolo
On 07/21/2015 11:31 PM, David Rientjes wrote:
> On Tue, 21 Jul 2015, Vlastimil Babka wrote:
>
>> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page
>> allocator: do not check NUMA node ID when the caller knows the node is valid")
>> as an optimized variant of alloc_pages_node(), that doesn't allow the node id
>> to be -1. Unfortunately the name of the function can easily suggest that the
>> allocation is restricted to the given node. In truth, the node is only
>> preferred, unless __GFP_THISNODE is among the gfp flags.
>>
>> The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm,
>> thp: really limit transparent hugepage allocation to local node") and
>> b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node").
>>
>> To prevent further mistakes, this patch renames the function to
>> alloc_pages_prefer_node() and documents it together with alloc_pages_node().
>>
>
> alloc_pages_exact_node(), as you said, connotates that the allocation will
> take place on that node or will fail. So why not go beyond this patch and
> actually make alloc_pages_exact_node() set __GFP_THISNODE and then call
> into a new alloc_pages_prefer_node(), which would be the current
> alloc_pages_exact_node() implementation, and then fix up the callers?
OK, but then we have alloc_pages_node(), alloc_pages_prefer_node() and
alloc_pages_exact_node(). Isn't that a bit too much? The first two
differ only in tiny bit:
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
/* Unknown node is current node */
if (nid < 0)
nid = numa_node_id();
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
So _prefer_node is just a tiny optimization over the other one. It
should be maybe called __alloc_pages_node() then? This would perhaps
discourage users outside of mm/arch code (where it may matter). The
savings of a skipped branch is likely dubious anyway... It would be also
nice if alloc_pages_node() could use __alloc_pages_node() internally, but
I'm not sure if all callers are safe wrt the
VM_BUG_ON(!node_online(nid)) part.
So when the alloc_pages_prefer_node is diminished as __alloc_pages_node
or outright removed, then maybe alloc_pages_exact_node() which adds
__GFP_THISNODE on its own, might be a useful wrapper. But I agree with
Christoph it's a duplication of the gfp_flags functionality and I don't
think there would be many users left anyway.
On Wed, 22 Jul 2015, Paolo Bonzini wrote:
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 2d73807..a8723a8 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
> > struct page *pages;
> > struct vmcs *vmcs;
> >
> > - pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
> > + pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order);
> > if (!pages)
> > return NULL;
> > vmcs = page_address(pages);
>
> Even though there's a pretty strong preference for the "right" node,
> things can work if the node is the wrong one. The order is always zero
> in practice, so the allocation should succeed.
>
You're code is fine both before and after the patch since __GFP_THISNODE
isn't set. The allocation will eventually succeed but, as you said, may
be from remote memory (and the success of allocating on node may be
influenced by the global setting of zone_reclaim_mode).
On Wed, 22 Jul 2015, Vlastimil Babka wrote:
> > alloc_pages_exact_node(), as you said, connotates that the allocation will
> > take place on that node or will fail. So why not go beyond this patch and
> > actually make alloc_pages_exact_node() set __GFP_THISNODE and then call
> > into a new alloc_pages_prefer_node(), which would be the current
> > alloc_pages_exact_node() implementation, and then fix up the callers?
>
> OK, but then we have alloc_pages_node(), alloc_pages_prefer_node() and
> alloc_pages_exact_node(). Isn't that a bit too much? The first two
> differ only in tiny bit:
>
> static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> unsigned int order)
> {
> /* Unknown node is current node */
> if (nid < 0)
> nid = numa_node_id();
>
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> }
>
> static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask,
> unsigned int order)
> {
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> }
>
Eek, yeah, that does look bad. I'm not even sure the
if (nid < 0)
nid = numa_node_id();
is correct; I think this should be comparing to NUMA_NO_NODE rather than
all negative numbers, otherwise we silently ignore overflow and nobody
ever knows.
> So _prefer_node is just a tiny optimization over the other one. It
> should be maybe called __alloc_pages_node() then? This would perhaps
> discourage users outside of mm/arch code (where it may matter). The
> savings of a skipped branch is likely dubious anyway... It would be also
> nice if alloc_pages_node() could use __alloc_pages_node() internally, but
> I'm not sure if all callers are safe wrt the
> VM_BUG_ON(!node_online(nid)) part.
>
I'm not sure how large you want to make your patch :) In a perfect world
I would think that we wouldn't have an alloc_pages_prefer_node() at all
and everything would be converted to alloc_pages_node() which would do
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
and then alloc_pages_exact_node() would do
return alloc_pages_node(nid, gfp_mask | __GFP_THISNODE, order);
and existing alloc_pages_exact_node() callers fixed up depending on
whether they set the bit or not.
The only possible downside would be existing users of
alloc_pages_node() that are calling it with an offline node. Since it's a
VM_BUG_ON() that would catch that, I think it should be changed to a
VM_WARN_ON() and eventually fixed up because it's nonsensical.
VM_BUG_ON() here should be avoided.
Or just go with a single alloc_pages_node() and rename __GFP_THISNODE to
__GFP_EXACT_NODE which may accomplish the same thing :)
On Wed, 22 Jul 2015, David Rientjes wrote:
> Eek, yeah, that does look bad. I'm not even sure the
>
> if (nid < 0)
> nid = numa_node_id();
>
> is correct; I think this should be comparing to NUMA_NO_NODE rather than
> all negative numbers, otherwise we silently ignore overflow and nobody
> ever knows.
Comparing to NUMA_NO_NODE would be better. Also use numa_mem_id() instead
to support memoryless nodes better?
> The only possible downside would be existing users of
> alloc_pages_node() that are calling it with an offline node. Since it's a
> VM_BUG_ON() that would catch that, I think it should be changed to a
> VM_WARN_ON() and eventually fixed up because it's nonsensical.
> VM_BUG_ON() here should be avoided.
The offline node thing could be addresses by using numa_mem_id()?
On Thu, 23 Jul 2015, Christoph Lameter wrote:
> > The only possible downside would be existing users of
> > alloc_pages_node() that are calling it with an offline node. Since it's a
> > VM_BUG_ON() that would catch that, I think it should be changed to a
> > VM_WARN_ON() and eventually fixed up because it's nonsensical.
> > VM_BUG_ON() here should be avoided.
>
> The offline node thing could be addresses by using numa_mem_id()?
>
I was concerned about any callers that were passing an offline node, not
NUMA_NO_NODE, today. One of the alloc-node functions has a VM_BUG_ON()
for it, the other silently calls node_zonelist() on it.
I suppose the final alloc_pages_node() implementation could be
if (nid == NUMA_NO_NODE || VM_WARN_ON(!node_online(nid)))
nid = numa_mem_id();
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
though.
On 07/23/2015 10:27 PM, David Rientjes wrote:
> On Thu, 23 Jul 2015, Christoph Lameter wrote:
>
>>> The only possible downside would be existing users of
>>> alloc_pages_node() that are calling it with an offline node. Since it's a
>>> VM_BUG_ON() that would catch that, I think it should be changed to a
>>> VM_WARN_ON() and eventually fixed up because it's nonsensical.
>>> VM_BUG_ON() here should be avoided.
>>
>> The offline node thing could be addresses by using numa_mem_id()?
>>
>
> I was concerned about any callers that were passing an offline node, not
> NUMA_NO_NODE, today. One of the alloc-node functions has a VM_BUG_ON()
> for it, the other silently calls node_zonelist() on it.
>
> I suppose the final alloc_pages_node() implementation could be
>
> if (nid == NUMA_NO_NODE || VM_WARN_ON(!node_online(nid)))
> nid = numa_mem_id();
>
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>
> though.
I've posted v2 based on David's and Christoph's suggestions (thanks) but
to avoid spamming everyone until we agree on the final interface, it's
marked as RFC and excludes the arch people from CC:
http://marc.info/?l=linux-kernel&m=143774920902608&w=2