2005-11-11 18:57:08

by Christoph Lameter

[permalink] [raw]
Subject: [RFC] NUMA memory policy support for HUGE pages

Well since we got through respecting cpusets and allocating a page nearer
to the processors so easy lets go for the full thing. Here is a draft of
a patch that implements full NUMA policy support for it on top of the
cpusets and the NUMA near allocation patch.

I am not sure that this is the right way to do it. Maybe we better put the
whole allocator into the policy layer like alloc_pages_vma?

I needed to add two parameters to alloc_huge_page in order to get the
allocation right for all policy cases. This means that find_lock_page
has a plethora of parameters now. Maybe idx and the mapping could be
deduced from addr and vma?

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.14-mm1.orig/mm/mempolicy.c 2005-11-10 14:33:16.000000000 -0800
+++ linux-2.6.14-mm1/mm/mempolicy.c 2005-11-11 10:47:24.000000000 -0800
@@ -1179,6 +1179,24 @@ static unsigned offset_il_node(struct me
return nid;
}

+/* Return a zonelist suitable for a huge page allocation. */
+struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol = get_vma_policy(current, vma, addr);
+
+ if (pol->policy == MPOL_INTERLEAVE) {
+ unsigned nid;
+ unsigned long off;
+
+ off = vma->vm_pgoff;
+ off += (addr - vma->vm_start) >> HPAGE_SHIFT;
+ nid = offset_il_node(pol, vma, off);
+
+ return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
+ }
+ return zonelist_policy(GFP_HIGHUSER, pol);
+}
+
/* Allocate a page in interleaved policy.
Own path because it needs to do special accounting. */
static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
Index: linux-2.6.14-mm1/mm/hugetlb.c
===================================================================
--- linux-2.6.14-mm1.orig/mm/hugetlb.c 2005-11-11 10:04:00.000000000 -0800
+++ linux-2.6.14-mm1/mm/hugetlb.c 2005-11-11 10:32:45.000000000 -0800
@@ -33,11 +33,12 @@ static void enqueue_huge_page(struct pag
free_huge_pages_node[nid]++;
}

-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct vm_area_struct *vma,
+ unsigned long address)
{
int nid = numa_node_id();
struct page *page = NULL;
- struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;
+ struct zonelist *zonelist = huge_zonelist(vma, address);
struct zone **z;

for (z = zonelist->zones; *z; z++) {
@@ -83,13 +84,13 @@ void free_huge_page(struct page *page)
spin_unlock(&hugetlb_lock);
}

-struct page *alloc_huge_page(void)
+struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr)
{
struct page *page;
int i;

spin_lock(&hugetlb_lock);
- page = dequeue_huge_page();
+ page = dequeue_huge_page(vma, addr);
if (!page) {
spin_unlock(&hugetlb_lock);
return NULL;
@@ -192,7 +193,7 @@ static unsigned long set_max_huge_pages(
spin_lock(&hugetlb_lock);
try_to_free_low(count);
while (count < nr_huge_pages) {
- struct page *page = dequeue_huge_page();
+ struct page *page = dequeue_huge_page(NULL, 0);
if (!page)
break;
update_and_free_page(page);
@@ -343,7 +344,8 @@ void unmap_hugepage_range(struct vm_area
flush_tlb_range(vma, start, end);
}

-static struct page *find_lock_huge_page(struct address_space *mapping,
+static struct page *find_lock_huge_page(struct vm_area_struct *vma,
+ unsigned long addr, struct address_space *mapping,
unsigned long idx)
{
struct page *page;
@@ -363,7 +365,7 @@ retry:

if (hugetlb_get_quota(mapping))
goto out;
- page = alloc_huge_page();
+ page = alloc_huge_page(vma, addr);
if (!page) {
hugetlb_put_quota(mapping);
goto out;
@@ -403,7 +405,7 @@ int hugetlb_fault(struct mm_struct *mm,
* Use page lock to guard against racing truncation
* before we get page_table_lock.
*/
- page = find_lock_huge_page(mapping, idx);
+ page = find_lock_huge_page(vma, address, mapping, idx);
if (!page)
goto out;

Index: linux-2.6.14-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-mm1.orig/include/linux/mempolicy.h 2005-11-10 13:32:00.000000000 -0800
+++ linux-2.6.14-mm1/include/linux/mempolicy.h 2005-11-11 10:29:00.000000000 -0800
@@ -159,6 +159,8 @@ extern void numa_policy_init(void);
extern void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new);
extern struct mempolicy default_policy;
extern unsigned next_slab_node(struct mempolicy *policy);
+extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
+ unsigned long addr);

int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
Index: linux-2.6.14-mm1/include/linux/hugetlb.h
===================================================================
--- linux-2.6.14-mm1.orig/include/linux/hugetlb.h 2005-11-09 10:47:09.000000000 -0800
+++ linux-2.6.14-mm1/include/linux/hugetlb.h 2005-11-11 10:45:57.000000000 -0800
@@ -22,7 +22,7 @@ int hugetlb_report_meminfo(char *);
int hugetlb_report_node_meminfo(int, char *);
int is_hugepage_mem_enough(size_t);
unsigned long hugetlb_total_pages(void);
-struct page *alloc_huge_page(void);
+struct page *alloc_huge_page(struct vm_area_struct *, unsigned long);
void free_huge_page(struct page *);
int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access);
@@ -97,7 +97,7 @@ static inline unsigned long hugetlb_tota
#define is_hugepage_only_range(mm, addr, len) 0
#define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) \
do { } while (0)
-#define alloc_huge_page() ({ NULL; })
+#define alloc_huge_page(vma, addr) ({ NULL; })
#define free_huge_page(p) ({ (void)(p); BUG(); })
#define hugetlb_fault(mm, vma, addr, write) ({ BUG(); 0; })


2005-11-11 20:29:44

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] NUMA memory policy support for HUGE pages

I just saw that mm2 is out. This is the same patch against mm2 with
hugetlb COW support.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-mm2/mm/mempolicy.c
===================================================================
--- linux-2.6.14-mm2.orig/mm/mempolicy.c 2005-11-11 12:10:19.000000000 -0800
+++ linux-2.6.14-mm2/mm/mempolicy.c 2005-11-11 12:11:01.000000000 -0800
@@ -1179,6 +1179,24 @@ static unsigned offset_il_node(struct me
return nid;
}

+/* Return a zonelist suitable for a huge page allocation. */
+struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol = get_vma_policy(current, vma, addr);
+
+ if (pol->policy == MPOL_INTERLEAVE) {
+ unsigned nid;
+ unsigned long off;
+
+ off = vma->vm_pgoff;
+ off += (addr - vma->vm_start) >> HPAGE_SHIFT;
+ nid = offset_il_node(pol, vma, off);
+
+ return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
+ }
+ return zonelist_policy(GFP_HIGHUSER, pol);
+}
+
/* Allocate a page in interleaved policy.
Own path because it needs to do special accounting. */
static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
Index: linux-2.6.14-mm2/mm/hugetlb.c
===================================================================
--- linux-2.6.14-mm2.orig/mm/hugetlb.c 2005-11-11 12:10:48.000000000 -0800
+++ linux-2.6.14-mm2/mm/hugetlb.c 2005-11-11 12:23:14.000000000 -0800
@@ -33,11 +33,12 @@ static void enqueue_huge_page(struct pag
free_huge_pages_node[nid]++;
}

-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct vm_area_struct *vma,
+ unsigned long address)
{
int nid = numa_node_id();
struct page *page = NULL;
- struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;
+ struct zonelist *zonelist = huge_zonelist(vma, address);
struct zone **z;

for (z = zonelist->zones; *z; z++) {
@@ -83,13 +84,13 @@ void free_huge_page(struct page *page)
spin_unlock(&hugetlb_lock);
}

-struct page *alloc_huge_page(void)
+struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr)
{
struct page *page;
int i;

spin_lock(&hugetlb_lock);
- page = dequeue_huge_page();
+ page = dequeue_huge_page(vma, addr);
if (!page) {
spin_unlock(&hugetlb_lock);
return NULL;
@@ -192,7 +193,7 @@ static unsigned long set_max_huge_pages(
spin_lock(&hugetlb_lock);
try_to_free_low(count);
while (count < nr_huge_pages) {
- struct page *page = dequeue_huge_page();
+ struct page *page = dequeue_huge_page(NULL, 0);
if (!page)
break;
update_and_free_page(page);
@@ -361,8 +362,8 @@ void unmap_hugepage_range(struct vm_area
flush_tlb_range(vma, start, end);
}

-static struct page *find_or_alloc_huge_page(struct address_space *mapping,
- unsigned long idx, int shared)
+static struct page *find_or_alloc_huge_page(struct vm_area_struct *vma, unsigned long addr,
+ struct address_space *mapping, unsigned long idx)
{
struct page *page;
int err;
@@ -374,13 +375,13 @@ retry:

if (hugetlb_get_quota(mapping))
goto out;
- page = alloc_huge_page();
+ page = alloc_huge_page(vma, addr);
if (!page) {
hugetlb_put_quota(mapping);
goto out;
}

- if (shared) {
+ if (vma->vm_flags & VM_SHARED) {
err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
if (err) {
put_page(page);
@@ -414,7 +415,7 @@ static int hugetlb_cow(struct mm_struct
}

page_cache_get(old_page);
- new_page = alloc_huge_page();
+ new_page = alloc_huge_page(vma, address);

if (!new_page) {
page_cache_release(old_page);
@@ -463,8 +464,7 @@ int hugetlb_no_page(struct mm_struct *mm
* Use page lock to guard against racing truncation
* before we get page_table_lock.
*/
- page = find_or_alloc_huge_page(mapping, idx,
- vma->vm_flags & VM_SHARED);
+ page = find_or_alloc_huge_page(vma, address, mapping, idx); ;
if (!page)
goto out;

Index: linux-2.6.14-mm2/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-mm2.orig/include/linux/mempolicy.h 2005-11-11 12:08:24.000000000 -0800
+++ linux-2.6.14-mm2/include/linux/mempolicy.h 2005-11-11 12:11:01.000000000 -0800
@@ -159,6 +159,8 @@ extern void numa_policy_init(void);
extern void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new);
extern struct mempolicy default_policy;
extern unsigned next_slab_node(struct mempolicy *policy);
+extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
+ unsigned long addr);

int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
Index: linux-2.6.14-mm2/include/linux/hugetlb.h
===================================================================
--- linux-2.6.14-mm2.orig/include/linux/hugetlb.h 2005-11-11 12:04:14.000000000 -0800
+++ linux-2.6.14-mm2/include/linux/hugetlb.h 2005-11-11 12:11:01.000000000 -0800
@@ -22,7 +22,7 @@ int hugetlb_report_meminfo(char *);
int hugetlb_report_node_meminfo(int, char *);
int is_hugepage_mem_enough(size_t);
unsigned long hugetlb_total_pages(void);
-struct page *alloc_huge_page(void);
+struct page *alloc_huge_page(struct vm_area_struct *, unsigned long);
void free_huge_page(struct page *);
int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access);
@@ -97,7 +97,7 @@ static inline unsigned long hugetlb_tota
#define is_hugepage_only_range(mm, addr, len) 0
#define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) \
do { } while (0)
-#define alloc_huge_page() ({ NULL; })
+#define alloc_huge_page(vma, addr) ({ NULL; })
#define free_huge_page(p) ({ (void)(p); BUG(); })
#define hugetlb_fault(mm, vma, addr, write) ({ BUG(); 0; })


2005-11-11 21:31:38

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] NUMA memory policy support for HUGE pages

On Fri, Nov 11, 2005 at 10:56:50AM -0800, Christoph Lameter wrote:
> Well since we got through respecting cpusets and allocating a page nearer
> to the processors so easy lets go for the full thing. Here is a draft of
> a patch that implements full NUMA policy support for it on top of the
> cpusets and the NUMA near allocation patch.
> I am not sure that this is the right way to do it. Maybe we better put the
> whole allocator into the policy layer like alloc_pages_vma?
> I needed to add two parameters to alloc_huge_page in order to get the
> allocation right for all policy cases. This means that find_lock_page
> has a plethora of parameters now. Maybe idx and the mapping could be
> deduced from addr and vma?

I've been awash in good hugetlb patches lately, and here's another one.
I don't have any strong feelings about this (apart from the code quality
observation), so could someone who has an interest in mempolicy affairs
(Andi, Adam, et al) chime in and say this is the way people want to go?


-- wli

2005-11-14 15:08:09

by Adam Litke

[permalink] [raw]
Subject: Re: [RFC] NUMA memory policy support for HUGE pages

On Fri, 2005-11-11 at 12:28 -0800, Christoph Lameter wrote:
> I just saw that mm2 is out. This is the same patch against mm2 with
> hugetlb COW support.

This all seems reasonable to me. Were you planning to send out a
separate patch to support MPOL_BIND?

> Signed-off-by: Christoph Lameter <[email protected]>

Acked-By: Adam Litke <[email protected]>

> Index: linux-2.6.14-mm2/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.14-mm2.orig/mm/mempolicy.c 2005-11-11 12:10:19.000000000 -0800
> +++ linux-2.6.14-mm2/mm/mempolicy.c 2005-11-11 12:11:01.000000000 -0800
> @@ -1179,6 +1179,24 @@ static unsigned offset_il_node(struct me
> return nid;
> }
>
> +/* Return a zonelist suitable for a huge page allocation. */
> +struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
> +{
> + struct mempolicy *pol = get_vma_policy(current, vma, addr);
> +
> + if (pol->policy == MPOL_INTERLEAVE) {
> + unsigned nid;
> + unsigned long off;
> +
> + off = vma->vm_pgoff;
> + off += (addr - vma->vm_start) >> HPAGE_SHIFT;
> + nid = offset_il_node(pol, vma, off);
> +
> + return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
> + }
> + return zonelist_policy(GFP_HIGHUSER, pol);
> +}
> +
> /* Allocate a page in interleaved policy.
> Own path because it needs to do special accounting. */
> static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> Index: linux-2.6.14-mm2/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.14-mm2.orig/mm/hugetlb.c 2005-11-11 12:10:48.000000000 -0800
> +++ linux-2.6.14-mm2/mm/hugetlb.c 2005-11-11 12:23:14.000000000 -0800
> @@ -33,11 +33,12 @@ static void enqueue_huge_page(struct pag
> free_huge_pages_node[nid]++;
> }
>
> -static struct page *dequeue_huge_page(void)
> +static struct page *dequeue_huge_page(struct vm_area_struct *vma,
> + unsigned long address)
> {
> int nid = numa_node_id();
> struct page *page = NULL;
> - struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;
> + struct zonelist *zonelist = huge_zonelist(vma, address);
> struct zone **z;
>
> for (z = zonelist->zones; *z; z++) {
> @@ -83,13 +84,13 @@ void free_huge_page(struct page *page)
> spin_unlock(&hugetlb_lock);
> }
>
> -struct page *alloc_huge_page(void)
> +struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr)
> {
> struct page *page;
> int i;
>
> spin_lock(&hugetlb_lock);
> - page = dequeue_huge_page();
> + page = dequeue_huge_page(vma, addr);
> if (!page) {
> spin_unlock(&hugetlb_lock);
> return NULL;
> @@ -192,7 +193,7 @@ static unsigned long set_max_huge_pages(
> spin_lock(&hugetlb_lock);
> try_to_free_low(count);
> while (count < nr_huge_pages) {
> - struct page *page = dequeue_huge_page();
> + struct page *page = dequeue_huge_page(NULL, 0);
> if (!page)
> break;
> update_and_free_page(page);
> @@ -361,8 +362,8 @@ void unmap_hugepage_range(struct vm_area
> flush_tlb_range(vma, start, end);
> }
>
> -static struct page *find_or_alloc_huge_page(struct address_space *mapping,
> - unsigned long idx, int shared)
> +static struct page *find_or_alloc_huge_page(struct vm_area_struct *vma, unsigned long addr,
> + struct address_space *mapping, unsigned long idx)
> {
> struct page *page;
> int err;
> @@ -374,13 +375,13 @@ retry:
>
> if (hugetlb_get_quota(mapping))
> goto out;
> - page = alloc_huge_page();
> + page = alloc_huge_page(vma, addr);
> if (!page) {
> hugetlb_put_quota(mapping);
> goto out;
> }
>
> - if (shared) {
> + if (vma->vm_flags & VM_SHARED) {
> err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
> if (err) {
> put_page(page);
> @@ -414,7 +415,7 @@ static int hugetlb_cow(struct mm_struct
> }
>
> page_cache_get(old_page);
> - new_page = alloc_huge_page();
> + new_page = alloc_huge_page(vma, address);
>
> if (!new_page) {
> page_cache_release(old_page);
> @@ -463,8 +464,7 @@ int hugetlb_no_page(struct mm_struct *mm
> * Use page lock to guard against racing truncation
> * before we get page_table_lock.
> */
> - page = find_or_alloc_huge_page(mapping, idx,
> - vma->vm_flags & VM_SHARED);
> + page = find_or_alloc_huge_page(vma, address, mapping, idx); ;
> if (!page)
> goto out;
>
> Index: linux-2.6.14-mm2/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.14-mm2.orig/include/linux/mempolicy.h 2005-11-11 12:08:24.000000000 -0800
> +++ linux-2.6.14-mm2/include/linux/mempolicy.h 2005-11-11 12:11:01.000000000 -0800
> @@ -159,6 +159,8 @@ extern void numa_policy_init(void);
> extern void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new);
> extern struct mempolicy default_policy;
> extern unsigned next_slab_node(struct mempolicy *policy);
> +extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> + unsigned long addr);
>
> int do_migrate_pages(struct mm_struct *mm,
> const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
> Index: linux-2.6.14-mm2/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.14-mm2.orig/include/linux/hugetlb.h 2005-11-11 12:04:14.000000000 -0800
> +++ linux-2.6.14-mm2/include/linux/hugetlb.h 2005-11-11 12:11:01.000000000 -0800
> @@ -22,7 +22,7 @@ int hugetlb_report_meminfo(char *);
> int hugetlb_report_node_meminfo(int, char *);
> int is_hugepage_mem_enough(size_t);
> unsigned long hugetlb_total_pages(void);
> -struct page *alloc_huge_page(void);
> +struct page *alloc_huge_page(struct vm_area_struct *, unsigned long);
> void free_huge_page(struct page *);
> int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, int write_access);
> @@ -97,7 +97,7 @@ static inline unsigned long hugetlb_tota
> #define is_hugepage_only_range(mm, addr, len) 0
> #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) \
> do { } while (0)
> -#define alloc_huge_page() ({ NULL; })
> +#define alloc_huge_page(vma, addr) ({ NULL; })
> #define free_huge_page(p) ({ (void)(p); BUG(); })
> #define hugetlb_fault(mm, vma, addr, write) ({ BUG(); 0; })
>
>
>
>
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2005-11-14 18:10:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] NUMA memory policy support for HUGE pages

On Mon, 14 Nov 2005, Adam Litke wrote:

> On Fri, 2005-11-11 at 12:28 -0800, Christoph Lameter wrote:
> > I just saw that mm2 is out. This is the same patch against mm2 with
> > hugetlb COW support.
>
> This all seems reasonable to me. Were you planning to send out a
> separate patch to support MPOL_BIND?

MPOL_BIND will provide a zonelist with only the nodes allowed. This is
included in the way the policy layer builds the zonelists.

2005-11-14 21:48:04

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] NUMA memory policy support for HUGE pages

This is V2 of the patch.

Changes:

- Cleaned up by folding find_or_alloc() into hugetlb_no_page().

- Consolidate common code in the memory policy layer by creating a new
function interleave_nid().

Patch on top of allocation patch and the cpuset patch that Andrew already
accepted.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.14-mm2/mm/mempolicy.c
===================================================================
--- linux-2.6.14-mm2.orig/mm/mempolicy.c 2005-11-14 12:51:23.000000000 -0800
+++ linux-2.6.14-mm2/mm/mempolicy.c 2005-11-14 13:16:51.000000000 -0800
@@ -1181,6 +1181,34 @@ static unsigned offset_il_node(struct me
return nid;
}

+/* Caculate a node number for interleave */
+static inline unsigned interleave_nid(struct mempolicy *pol,
+ struct vm_area_struct *vma, unsigned long addr, int shift)
+{
+ if (vma) {
+ unsigned long off;
+
+ off = vma->vm_pgoff;
+ off += (addr - vma->vm_start) >> shift;
+ return offset_il_node(pol, vma, off);
+ } else
+ return interleave_nodes(pol);
+}
+
+/* Return a zonelist suitable for a huge page allocation. */
+struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol = get_vma_policy(current, vma, addr);
+
+ if (pol->policy == MPOL_INTERLEAVE) {
+ unsigned nid;
+
+ nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
+ return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
+ }
+ return zonelist_policy(GFP_HIGHUSER, pol);
+}
+
/* Allocate a page in interleaved policy.
Own path because it needs to do special accounting. */
static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
@@ -1229,15 +1257,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area

if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
unsigned nid;
- if (vma) {
- unsigned long off;
- off = vma->vm_pgoff;
- off += (addr - vma->vm_start) >> PAGE_SHIFT;
- nid = offset_il_node(pol, vma, off);
- } else {
- /* fall back to process interleaving */
- nid = interleave_nodes(pol);
- }
+
+ nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
return alloc_page_interleave(gfp, 0, nid);
}
return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
Index: linux-2.6.14-mm2/mm/hugetlb.c
===================================================================
--- linux-2.6.14-mm2.orig/mm/hugetlb.c 2005-11-14 12:51:23.000000000 -0800
+++ linux-2.6.14-mm2/mm/hugetlb.c 2005-11-14 13:37:16.000000000 -0800
@@ -33,11 +33,12 @@ static void enqueue_huge_page(struct pag
free_huge_pages_node[nid]++;
}

-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct vm_area_struct *vma,
+ unsigned long address)
{
int nid = numa_node_id();
struct page *page = NULL;
- struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;
+ struct zonelist *zonelist = huge_zonelist(vma, address);
struct zone **z;

for (z = zonelist->zones; *z; z++) {
@@ -83,13 +84,13 @@ void free_huge_page(struct page *page)
spin_unlock(&hugetlb_lock);
}

-struct page *alloc_huge_page(void)
+struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr)
{
struct page *page;
int i;

spin_lock(&hugetlb_lock);
- page = dequeue_huge_page();
+ page = dequeue_huge_page(vma, addr);
if (!page) {
spin_unlock(&hugetlb_lock);
return NULL;
@@ -192,7 +193,7 @@ static unsigned long set_max_huge_pages(
spin_lock(&hugetlb_lock);
try_to_free_low(count);
while (count < nr_huge_pages) {
- struct page *page = dequeue_huge_page();
+ struct page *page = dequeue_huge_page(NULL, 0);
if (!page)
break;
update_and_free_page(page);
@@ -361,42 +362,6 @@ void unmap_hugepage_range(struct vm_area
flush_tlb_range(vma, start, end);
}

-static struct page *find_or_alloc_huge_page(struct address_space *mapping,
- unsigned long idx, int shared)
-{
- struct page *page;
- int err;
-
-retry:
- page = find_lock_page(mapping, idx);
- if (page)
- goto out;
-
- if (hugetlb_get_quota(mapping))
- goto out;
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- goto out;
- }
-
- if (shared) {
- err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
- if (err) {
- put_page(page);
- hugetlb_put_quota(mapping);
- if (err == -EEXIST)
- goto retry;
- page = NULL;
- }
- } else {
- /* Caller expects a locked page */
- lock_page(page);
- }
-out:
- return page;
-}
-
static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pte_t pte)
{
@@ -414,7 +379,7 @@ static int hugetlb_cow(struct mm_struct
}

page_cache_get(old_page);
- new_page = alloc_huge_page();
+ new_page = alloc_huge_page(vma, address);

if (!new_page) {
page_cache_release(old_page);
@@ -463,10 +428,32 @@ int hugetlb_no_page(struct mm_struct *mm
* Use page lock to guard against racing truncation
* before we get page_table_lock.
*/
- page = find_or_alloc_huge_page(mapping, idx,
- vma->vm_flags & VM_SHARED);
- if (!page)
- goto out;
+retry:
+ page = find_lock_page(mapping, idx);
+ if (!page) {
+ if (hugetlb_get_quota(mapping))
+ goto out;
+
+ page = alloc_huge_page(vma, address);
+ if (!page) {
+ hugetlb_put_quota(mapping);
+ goto out;
+ }
+
+ if (vma->vm_flags & VM_SHARED) {
+ int err;
+
+ err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+ if (err) {
+ put_page(page);
+ hugetlb_put_quota(mapping);
+ if (err == -EEXIST)
+ goto retry;
+ goto out;
+ }
+ }
+ lock_page(page);
+ }

BUG_ON(!PageLocked(page));

Index: linux-2.6.14-mm2/include/linux/mempolicy.h
===================================================================
--- linux-2.6.14-mm2.orig/include/linux/mempolicy.h 2005-11-14 12:51:22.000000000 -0800
+++ linux-2.6.14-mm2/include/linux/mempolicy.h 2005-11-14 12:51:23.000000000 -0800
@@ -159,6 +159,8 @@ extern void numa_policy_init(void);
extern void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new);
extern struct mempolicy default_policy;
extern unsigned next_slab_node(struct mempolicy *policy);
+extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
+ unsigned long addr);

int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
Index: linux-2.6.14-mm2/include/linux/hugetlb.h
===================================================================
--- linux-2.6.14-mm2.orig/include/linux/hugetlb.h 2005-11-14 12:51:17.000000000 -0800
+++ linux-2.6.14-mm2/include/linux/hugetlb.h 2005-11-14 12:51:23.000000000 -0800
@@ -22,7 +22,7 @@ int hugetlb_report_meminfo(char *);
int hugetlb_report_node_meminfo(int, char *);
int is_hugepage_mem_enough(size_t);
unsigned long hugetlb_total_pages(void);
-struct page *alloc_huge_page(void);
+struct page *alloc_huge_page(struct vm_area_struct *, unsigned long);
void free_huge_page(struct page *);
int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access);
@@ -97,7 +97,7 @@ static inline unsigned long hugetlb_tota
#define is_hugepage_only_range(mm, addr, len) 0
#define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) \
do { } while (0)
-#define alloc_huge_page() ({ NULL; })
+#define alloc_huge_page(vma, addr) ({ NULL; })
#define free_huge_page(p) ({ (void)(p); BUG(); })
#define hugetlb_fault(mm, vma, addr, write) ({ BUG(); 0; })

2005-11-14 22:31:16

by Adam Litke

[permalink] [raw]
Subject: Re: [RFC] NUMA memory policy support for HUGE pages

On Mon, 2005-11-14 at 13:46 -0800, Christoph Lameter wrote:
> This is V2 of the patch.
>
> Changes:
>
> - Cleaned up by folding find_or_alloc() into hugetlb_no_page().

IMHO this is not really a cleanup. When the demand fault patch stack
was first accepted, we decided to separate out find_or_alloc_huge_page()
because it has the page_cache retry loop with several exit conditions.
no_page() has its own backout logic and mixing the two makes for a
tangled mess. Can we leave that hunk out please?

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2005-11-14 23:25:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] NUMA memory policy support for HUGE pages

On Mon, 14 Nov 2005, Adam Litke wrote:

> On Mon, 2005-11-14 at 13:46 -0800, Christoph Lameter wrote:
> > This is V2 of the patch.
> >
> > Changes:
> >
> > - Cleaned up by folding find_or_alloc() into hugetlb_no_page().
>
> IMHO this is not really a cleanup. When the demand fault patch stack
> was first accepted, we decided to separate out find_or_alloc_huge_page()
> because it has the page_cache retry loop with several exit conditions.
> no_page() has its own backout logic and mixing the two makes for a
> tangled mess. Can we leave that hunk out please?

It seemed to me that find_or_alloc_huge_pages has a pretty simple backout
logic that folds nicely into no_page(). Both functions share a lot of
variables and putting them together not only increases the readability of
the code but also makes the function smaller and execution more efficient.

2005-11-15 12:26:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] NUMA memory policy support for HUGE pages

On Mon, 14 Nov 2005, Adam Litke wrote:
>> IMHO this is not really a cleanup. When the demand fault patch stack
>> was first accepted, we decided to separate out find_or_alloc_huge_page()
>> because it has the page_cache retry loop with several exit conditions.
>> no_page() has its own backout logic and mixing the two makes for a
>> tangled mess. Can we leave that hunk out please?

On Mon, Nov 14, 2005 at 03:25:00PM -0800, Christoph Lameter wrote:
> It seemed to me that find_or_alloc_huge_pages has a pretty simple backout
> logic that folds nicely into no_page(). Both functions share a lot of
> variables and putting them together not only increases the readability of
> the code but also makes the function smaller and execution more efficient.

Looks like this is on the road to inclusion and so on. I'm not picky
about either approach wrt. nopage/etc. and find_or_alloc_huge_page()
affairs. Just get a consensus together and send it in.

Thanks.


-- wli