2008-03-17 01:58:31

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [0/18] GB pages hugetlb support


This patchkit supports GB pages for hugetlb on x86-64 in addition to
2MB pages. This is the sucessor of an earlier much simpler
patchkit that allowed to set the hugepagesz globally at boot
to 1GB pages. The advantage of this more complex patchkit
is that it allows 2MB page users and 1GB page users to
coexist (although not on the same hugetlbfs mount points)

It first adds some straight-forward infrastructure
to hugetlbfs to support multiple page sizes. Then it uses that
infrastructure to implement support for huge pages > MAX_ORDER
(which can be allocated at boot with bootmem only). Then
the x86-64 port is extended to support 1GB pages on CPUs
that support them (AMD Quad Cores)

There is no support for i386 because GB pages are only available in
long mode.

The variable page size support is currently limited to the
specific use case of the single additional 1GB page size.
Using it for more page sizes (especially those < MAX_ORDER)
would require some more work, although the basic infrastructure
is all in place and the incremental work will be small.
But I didn't bother to implement some corner cases not needed
for the GB page case. I usually added comments so they
should be easy to find (and fix) later however :)

I hacked in also cpuset support. It would be good if
Paul double checked that.

GB pages are only intended to be used in special situations, like
dedicated databases where complicated configuration does not matter.
That is why they have some limitations:
- Can be only allocated at boot (using hugepagesz=1G hugepages=...)
- Can't be freed at runtime
- One hugetlbfs mount per page size (using the pagesize=... mount
option). This is a little awkward, but greatly simplified the
code.
- No IPC SHM support currently (would not be very hard to do,
but it is unclear what the best API for this is. Suggestions
welcome)

Some of this would be fixable later.

Known issues:
- GB pages are not reported in total memory, which gives
confusing free(1) output
- I have still to explain myself how and if free_pgd_pages works
on hugetlb, both with 1GB and with 2MB pages.
- cpuset support is a little dubious, but the code was
even before quite strange.
- lockdep sometimes complains about recursive page_table_locks
for shared hugetlb memory, but as far as I can see I didn't
actually change this area. Looks a little dubious, might
be a false positive too.
- hugemmap04 from LTP fails. Cause unknown currently

-Andi


2008-03-17 01:58:45

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [1/18] Convert hugeltlb.c over to pass global state around in a structure


Large, but rather mechanical patch that converts most of the hugetlb.c
globals into structure members and passes them around.

Right now there is only a single global hstate structure, but
most of the infrastructure to extend it is there.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/ia64/mm/hugetlbpage.c | 2
arch/powerpc/mm/hugetlbpage.c | 2
arch/sh/mm/hugetlbpage.c | 2
arch/sparc64/mm/hugetlbpage.c | 2
arch/x86/mm/hugetlbpage.c | 2
fs/hugetlbfs/inode.c | 45 +++---
include/linux/hugetlb.h | 70 +++++++++
ipc/shm.c | 3
mm/hugetlb.c | 295 ++++++++++++++++++++++--------------------
mm/memory.c | 2
mm/mempolicy.c | 10 -
mm/mmap.c | 3
12 files changed, 269 insertions(+), 169 deletions(-)

Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -22,30 +22,24 @@
#include "internal.h"

const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
-static unsigned long surplus_huge_pages;
-static unsigned long nr_overcommit_huge_pages;
unsigned long max_huge_pages;
unsigned long sysctl_overcommit_huge_pages;
-static struct list_head hugepage_freelists[MAX_NUMNODES];
-static unsigned int nr_huge_pages_node[MAX_NUMNODES];
-static unsigned int free_huge_pages_node[MAX_NUMNODES];
-static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;
-static int hugetlb_next_nid;
+
+struct hstate global_hstate;

/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
*/
static DEFINE_SPINLOCK(hugetlb_lock);

-static void clear_huge_page(struct page *page, unsigned long addr)
+static void clear_huge_page(struct page *page, unsigned long addr, unsigned sz)
{
int i;

might_sleep();
- for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); i++) {
+ for (i = 0; i < sz/PAGE_SIZE; i++) {
cond_resched();
clear_user_highpage(page + i, addr + i * PAGE_SIZE);
}
@@ -55,34 +49,35 @@ static void copy_huge_page(struct page *
unsigned long addr, struct vm_area_struct *vma)
{
int i;
+ struct hstate *h = hstate_vma(vma);

might_sleep();
- for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
+ for (i = 0; i < 1 << huge_page_order(h); i++) {
cond_resched();
copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
}
}

-static void enqueue_huge_page(struct page *page)
+static void enqueue_huge_page(struct hstate *h, struct page *page)
{
int nid = page_to_nid(page);
- list_add(&page->lru, &hugepage_freelists[nid]);
- free_huge_pages++;
- free_huge_pages_node[nid]++;
+ list_add(&page->lru, &h->hugepage_freelists[nid]);
+ h->free_huge_pages++;
+ h->free_huge_pages_node[nid]++;
}

-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct hstate *h)
{
int nid;
struct page *page = NULL;

for (nid = 0; nid < MAX_NUMNODES; ++nid) {
- if (!list_empty(&hugepage_freelists[nid])) {
- page = list_entry(hugepage_freelists[nid].next,
+ if (!list_empty(&h->hugepage_freelists[nid])) {
+ page = list_entry(h->hugepage_freelists[nid].next,
struct page, lru);
list_del(&page->lru);
- free_huge_pages--;
- free_huge_pages_node[nid]--;
+ h->free_huge_pages--;
+ h->free_huge_pages_node[nid]--;
break;
}
}
@@ -98,18 +93,19 @@ static struct page *dequeue_huge_page_vm
struct zonelist *zonelist = huge_zonelist(vma, address,
htlb_alloc_mask, &mpol);
struct zone **z;
+ struct hstate *h = hstate_vma(vma);

for (z = zonelist->zones; *z; z++) {
nid = zone_to_nid(*z);
if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
- !list_empty(&hugepage_freelists[nid])) {
- page = list_entry(hugepage_freelists[nid].next,
+ !list_empty(&h->hugepage_freelists[nid])) {
+ page = list_entry(h->hugepage_freelists[nid].next,
struct page, lru);
list_del(&page->lru);
- free_huge_pages--;
- free_huge_pages_node[nid]--;
+ h->free_huge_pages--;
+ h->free_huge_pages_node[nid]--;
if (vma && vma->vm_flags & VM_MAYSHARE)
- resv_huge_pages--;
+ h->resv_huge_pages--;
break;
}
}
@@ -117,23 +113,24 @@ static struct page *dequeue_huge_page_vm
return page;
}

-static void update_and_free_page(struct page *page)
+static void update_and_free_page(struct hstate *h, struct page *page)
{
int i;
- nr_huge_pages--;
- nr_huge_pages_node[page_to_nid(page)]--;
- for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
+ h->nr_huge_pages--;
+ h->nr_huge_pages_node[page_to_nid(page)]--;
+ for (i = 0; i < (1 << huge_page_order(h)); i++) {
page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
1 << PG_private | 1<< PG_writeback);
}
set_compound_page_dtor(page, NULL);
set_page_refcounted(page);
- __free_pages(page, HUGETLB_PAGE_ORDER);
+ __free_pages(page, huge_page_order(h));
}

static void free_huge_page(struct page *page)
{
+ struct hstate *h = &global_hstate;
int nid = page_to_nid(page);
struct address_space *mapping;

@@ -143,12 +140,12 @@ static void free_huge_page(struct page *
INIT_LIST_HEAD(&page->lru);

spin_lock(&hugetlb_lock);
- if (surplus_huge_pages_node[nid]) {
- update_and_free_page(page);
- surplus_huge_pages--;
- surplus_huge_pages_node[nid]--;
+ if (h->surplus_huge_pages_node[nid]) {
+ update_and_free_page(h, page);
+ h->surplus_huge_pages--;
+ h->surplus_huge_pages_node[nid]--;
} else {
- enqueue_huge_page(page);
+ enqueue_huge_page(h, page);
}
spin_unlock(&hugetlb_lock);
if (mapping)
@@ -160,7 +157,7 @@ static void free_huge_page(struct page *
* balanced by operating on them in a round-robin fashion.
* Returns 1 if an adjustment was made.
*/
-static int adjust_pool_surplus(int delta)
+static int adjust_pool_surplus(struct hstate *h, int delta)
{
static int prev_nid;
int nid = prev_nid;
@@ -173,15 +170,15 @@ static int adjust_pool_surplus(int delta
nid = first_node(node_online_map);

/* To shrink on this node, there must be a surplus page */
- if (delta < 0 && !surplus_huge_pages_node[nid])
+ if (delta < 0 && !h->surplus_huge_pages_node[nid])
continue;
/* Surplus cannot exceed the total number of pages */
- if (delta > 0 && surplus_huge_pages_node[nid] >=
- nr_huge_pages_node[nid])
+ if (delta > 0 && h->surplus_huge_pages_node[nid] >=
+ h->nr_huge_pages_node[nid])
continue;

- surplus_huge_pages += delta;
- surplus_huge_pages_node[nid] += delta;
+ h->surplus_huge_pages += delta;
+ h->surplus_huge_pages_node[nid] += delta;
ret = 1;
break;
} while (nid != prev_nid);
@@ -190,18 +187,18 @@ static int adjust_pool_surplus(int delta
return ret;
}

-static struct page *alloc_fresh_huge_page_node(int nid)
+static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
{
struct page *page;

page = alloc_pages_node(nid,
htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
- HUGETLB_PAGE_ORDER);
+ huge_page_order(h));
if (page) {
set_compound_page_dtor(page, free_huge_page);
spin_lock(&hugetlb_lock);
- nr_huge_pages++;
- nr_huge_pages_node[nid]++;
+ h->nr_huge_pages++;
+ h->nr_huge_pages_node[nid]++;
spin_unlock(&hugetlb_lock);
put_page(page); /* free it into the hugepage allocator */
}
@@ -209,17 +206,17 @@ static struct page *alloc_fresh_huge_pag
return page;
}

-static int alloc_fresh_huge_page(void)
+static int alloc_fresh_huge_page(struct hstate *h)
{
struct page *page;
int start_nid;
int next_nid;
int ret = 0;

- start_nid = hugetlb_next_nid;
+ start_nid = h->hugetlb_next_nid;

do {
- page = alloc_fresh_huge_page_node(hugetlb_next_nid);
+ page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
if (page)
ret = 1;
/*
@@ -233,17 +230,18 @@ static int alloc_fresh_huge_page(void)
* if we just successfully allocated a hugepage so that
* the next caller gets hugepages on the next node.
*/
- next_nid = next_node(hugetlb_next_nid, node_online_map);
+ next_nid = next_node(h->hugetlb_next_nid, node_online_map);
if (next_nid == MAX_NUMNODES)
next_nid = first_node(node_online_map);
- hugetlb_next_nid = next_nid;
- } while (!page && hugetlb_next_nid != start_nid);
+ h->hugetlb_next_nid = next_nid;
+ } while (!page && h->hugetlb_next_nid != start_nid);

return ret;
}

-static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
- unsigned long address)
+static struct page *alloc_buddy_huge_page(struct hstate *h,
+ struct vm_area_struct *vma,
+ unsigned long address)
{
struct page *page;
unsigned int nid;
@@ -272,17 +270,17 @@ static struct page *alloc_buddy_huge_pag
* per-node value is checked there.
*/
spin_lock(&hugetlb_lock);
- if (surplus_huge_pages >= nr_overcommit_huge_pages) {
+ if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
spin_unlock(&hugetlb_lock);
return NULL;
} else {
- nr_huge_pages++;
- surplus_huge_pages++;
+ h->nr_huge_pages++;
+ h->surplus_huge_pages++;
}
spin_unlock(&hugetlb_lock);

page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
- HUGETLB_PAGE_ORDER);
+ huge_page_order(h));

spin_lock(&hugetlb_lock);
if (page) {
@@ -291,11 +289,11 @@ static struct page *alloc_buddy_huge_pag
/*
* We incremented the global counters already
*/
- nr_huge_pages_node[nid]++;
- surplus_huge_pages_node[nid]++;
+ h->nr_huge_pages_node[nid]++;
+ h->surplus_huge_pages_node[nid]++;
} else {
- nr_huge_pages--;
- surplus_huge_pages--;
+ h->nr_huge_pages--;
+ h->surplus_huge_pages--;
}
spin_unlock(&hugetlb_lock);

@@ -306,16 +304,16 @@ static struct page *alloc_buddy_huge_pag
* Increase the hugetlb pool such that it can accomodate a reservation
* of size 'delta'.
*/
-static int gather_surplus_pages(int delta)
+static int gather_surplus_pages(struct hstate *h, int delta)
{
struct list_head surplus_list;
struct page *page, *tmp;
int ret, i;
int needed, allocated;

- needed = (resv_huge_pages + delta) - free_huge_pages;
+ needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
if (needed <= 0) {
- resv_huge_pages += delta;
+ h->resv_huge_pages += delta;
return 0;
}

@@ -326,7 +324,7 @@ static int gather_surplus_pages(int delt
retry:
spin_unlock(&hugetlb_lock);
for (i = 0; i < needed; i++) {
- page = alloc_buddy_huge_page(NULL, 0);
+ page = alloc_buddy_huge_page(h, NULL, 0);
if (!page) {
/*
* We were not able to allocate enough pages to
@@ -347,7 +345,8 @@ retry:
* because either resv_huge_pages or free_huge_pages may have changed.
*/
spin_lock(&hugetlb_lock);
- needed = (resv_huge_pages + delta) - (free_huge_pages + allocated);
+ needed = (h->resv_huge_pages + delta) -
+ (h->free_huge_pages + allocated);
if (needed > 0)
goto retry;

@@ -360,13 +359,13 @@ retry:
* before they are reserved.
*/
needed += allocated;
- resv_huge_pages += delta;
+ h->resv_huge_pages += delta;
ret = 0;
free:
list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
list_del(&page->lru);
if ((--needed) >= 0)
- enqueue_huge_page(page);
+ enqueue_huge_page(h, page);
else {
/*
* Decrement the refcount and free the page using its
@@ -388,34 +387,35 @@ free:
* allocated to satisfy the reservation must be explicitly freed if they were
* never used.
*/
-static void return_unused_surplus_pages(unsigned long unused_resv_pages)
+static void
+return_unused_surplus_pages(struct hstate *h, unsigned long unused_resv_pages)
{
static int nid = -1;
struct page *page;
unsigned long nr_pages;

/* Uncommit the reservation */
- resv_huge_pages -= unused_resv_pages;
+ h->resv_huge_pages -= unused_resv_pages;

- nr_pages = min(unused_resv_pages, surplus_huge_pages);
+ nr_pages = min(unused_resv_pages, h->surplus_huge_pages);

while (nr_pages) {
nid = next_node(nid, node_online_map);
if (nid == MAX_NUMNODES)
nid = first_node(node_online_map);

- if (!surplus_huge_pages_node[nid])
+ if (!h->surplus_huge_pages_node[nid])
continue;

- if (!list_empty(&hugepage_freelists[nid])) {
- page = list_entry(hugepage_freelists[nid].next,
+ if (!list_empty(&h->hugepage_freelists[nid])) {
+ page = list_entry(h->hugepage_freelists[nid].next,
struct page, lru);
list_del(&page->lru);
- update_and_free_page(page);
- free_huge_pages--;
- free_huge_pages_node[nid]--;
- surplus_huge_pages--;
- surplus_huge_pages_node[nid]--;
+ update_and_free_page(h, page);
+ h->free_huge_pages--;
+ h->free_huge_pages_node[nid]--;
+ h->surplus_huge_pages--;
+ h->surplus_huge_pages_node[nid]--;
nr_pages--;
}
}
@@ -437,16 +437,17 @@ static struct page *alloc_huge_page_priv
unsigned long addr)
{
struct page *page = NULL;
+ struct hstate *h = hstate_vma(vma);

if (hugetlb_get_quota(vma->vm_file->f_mapping, 1))
return ERR_PTR(-VM_FAULT_SIGBUS);

spin_lock(&hugetlb_lock);
- if (free_huge_pages > resv_huge_pages)
+ if (h->free_huge_pages > h->resv_huge_pages)
page = dequeue_huge_page_vma(vma, addr);
spin_unlock(&hugetlb_lock);
if (!page) {
- page = alloc_buddy_huge_page(vma, addr);
+ page = alloc_buddy_huge_page(h, vma, addr);
if (!page) {
hugetlb_put_quota(vma->vm_file->f_mapping, 1);
return ERR_PTR(-VM_FAULT_OOM);
@@ -476,21 +477,27 @@ static struct page *alloc_huge_page(stru
static int __init hugetlb_init(void)
{
unsigned long i;
+ struct hstate *h = &global_hstate;

if (HPAGE_SHIFT == 0)
return 0;

+ if (!h->order) {
+ h->order = HPAGE_SHIFT - PAGE_SHIFT;
+ h->mask = HPAGE_MASK;
+ }
+
for (i = 0; i < MAX_NUMNODES; ++i)
- INIT_LIST_HEAD(&hugepage_freelists[i]);
+ INIT_LIST_HEAD(&h->hugepage_freelists[i]);

- hugetlb_next_nid = first_node(node_online_map);
+ h->hugetlb_next_nid = first_node(node_online_map);

for (i = 0; i < max_huge_pages; ++i) {
- if (!alloc_fresh_huge_page())
+ if (!alloc_fresh_huge_page(h))
break;
}
- max_huge_pages = free_huge_pages = nr_huge_pages = i;
- printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
+ max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+ printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
return 0;
}
module_init(hugetlb_init);
@@ -518,19 +525,21 @@ static unsigned int cpuset_mems_nr(unsig
#ifdef CONFIG_HIGHMEM
static void try_to_free_low(unsigned long count)
{
+ struct hstate *h = &global_hstate;
int i;

for (i = 0; i < MAX_NUMNODES; ++i) {
struct page *page, *next;
- list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
+ struct list_head *freel = &h->hugepage_freelists[i];
+ list_for_each_entry_safe(page, next, freel, lru) {
if (count >= nr_huge_pages)
return;
if (PageHighMem(page))
continue;
list_del(&page->lru);
update_and_free_page(page);
- free_huge_pages--;
- free_huge_pages_node[page_to_nid(page)]--;
+ h->free_huge_pages--;
+ h->free_huge_pages_node[page_to_nid(page)]--;
}
}
}
@@ -540,10 +549,11 @@ static inline void try_to_free_low(unsig
}
#endif

-#define persistent_huge_pages (nr_huge_pages - surplus_huge_pages)
+#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
static unsigned long set_max_huge_pages(unsigned long count)
{
unsigned long min_count, ret;
+ struct hstate *h = &global_hstate;

/*
* Increase the pool size
@@ -557,12 +567,12 @@ static unsigned long set_max_huge_pages(
* within all the constraints specified by the sysctls.
*/
spin_lock(&hugetlb_lock);
- while (surplus_huge_pages && count > persistent_huge_pages) {
- if (!adjust_pool_surplus(-1))
+ while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
+ if (!adjust_pool_surplus(h, -1))
break;
}

- while (count > persistent_huge_pages) {
+ while (count > persistent_huge_pages(h)) {
int ret;
/*
* If this allocation races such that we no longer need the
@@ -570,7 +580,7 @@ static unsigned long set_max_huge_pages(
* and reducing the surplus.
*/
spin_unlock(&hugetlb_lock);
- ret = alloc_fresh_huge_page();
+ ret = alloc_fresh_huge_page(h);
spin_lock(&hugetlb_lock);
if (!ret)
goto out;
@@ -592,21 +602,21 @@ static unsigned long set_max_huge_pages(
* and won't grow the pool anywhere else. Not until one of the
* sysctls are changed, or the surplus pages go out of use.
*/
- min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
+ min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
min_count = max(count, min_count);
try_to_free_low(min_count);
- while (min_count < persistent_huge_pages) {
- struct page *page = dequeue_huge_page();
+ while (min_count < persistent_huge_pages(h)) {
+ struct page *page = dequeue_huge_page(h);
if (!page)
break;
- update_and_free_page(page);
+ update_and_free_page(h, page);
}
- while (count < persistent_huge_pages) {
- if (!adjust_pool_surplus(1))
+ while (count < persistent_huge_pages(h)) {
+ if (!adjust_pool_surplus(h, 1))
break;
}
out:
- ret = persistent_huge_pages;
+ ret = persistent_huge_pages(h);
spin_unlock(&hugetlb_lock);
return ret;
}
@@ -636,9 +646,10 @@ int hugetlb_overcommit_handler(struct ct
struct file *file, void __user *buffer,
size_t *length, loff_t *ppos)
{
+ struct hstate *h = &global_hstate;
proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
spin_lock(&hugetlb_lock);
- nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
+ h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
spin_unlock(&hugetlb_lock);
return 0;
}
@@ -647,32 +658,35 @@ int hugetlb_overcommit_handler(struct ct

int hugetlb_report_meminfo(char *buf)
{
+ struct hstate *h = &global_hstate;
return sprintf(buf,
"HugePages_Total: %5lu\n"
"HugePages_Free: %5lu\n"
"HugePages_Rsvd: %5lu\n"
"HugePages_Surp: %5lu\n"
"Hugepagesize: %5lu kB\n",
- nr_huge_pages,
- free_huge_pages,
- resv_huge_pages,
- surplus_huge_pages,
- HPAGE_SIZE/1024);
+ h->nr_huge_pages,
+ h->free_huge_pages,
+ h->resv_huge_pages,
+ h->surplus_huge_pages,
+ 1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
}

int hugetlb_report_node_meminfo(int nid, char *buf)
{
+ struct hstate *h = &global_hstate;
return sprintf(buf,
"Node %d HugePages_Total: %5u\n"
"Node %d HugePages_Free: %5u\n",
- nid, nr_huge_pages_node[nid],
- nid, free_huge_pages_node[nid]);
+ nid, h->nr_huge_pages_node[nid],
+ nid, h->free_huge_pages_node[nid]);
}

/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
unsigned long hugetlb_total_pages(void)
{
- return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE);
+ struct hstate *h = &global_hstate;
+ return h->nr_huge_pages * (1 << huge_page_order(h));
}

/*
@@ -727,14 +741,16 @@ int copy_hugetlb_page_range(struct mm_st
struct page *ptepage;
unsigned long addr;
int cow;
+ struct hstate *h = hstate_vma(vma);
+ unsigned sz = huge_page_size(h);

cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
src_pte = huge_pte_offset(src, addr);
if (!src_pte)
continue;
- dst_pte = huge_pte_alloc(dst, addr);
+ dst_pte = huge_pte_alloc(dst, addr, sz);
if (!dst_pte)
goto nomem;

@@ -770,6 +786,9 @@ void __unmap_hugepage_range(struct vm_ar
pte_t pte;
struct page *page;
struct page *tmp;
+ struct hstate *h = hstate_vma(vma);
+ unsigned sz = huge_page_size(h);
+
/*
* A page gathering list, protected by per file i_mmap_lock. The
* lock is used to avoid list corruption from multiple unmapping
@@ -778,11 +797,11 @@ void __unmap_hugepage_range(struct vm_ar
LIST_HEAD(page_list);

WARN_ON(!is_vm_hugetlb_page(vma));
- BUG_ON(start & ~HPAGE_MASK);
- BUG_ON(end & ~HPAGE_MASK);
+ BUG_ON(start & ~huge_page_mask(h));
+ BUG_ON(end & ~huge_page_mask(h));

spin_lock(&mm->page_table_lock);
- for (address = start; address < end; address += HPAGE_SIZE) {
+ for (address = start; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;
@@ -830,6 +849,7 @@ static int hugetlb_cow(struct mm_struct
{
struct page *old_page, *new_page;
int avoidcopy;
+ struct hstate *h = hstate_vma(vma);

old_page = pte_page(pte);

@@ -854,7 +874,7 @@ static int hugetlb_cow(struct mm_struct
__SetPageUptodate(new_page);
spin_lock(&mm->page_table_lock);

- ptep = huge_pte_offset(mm, address & HPAGE_MASK);
+ ptep = huge_pte_offset(mm, address & huge_page_mask(h));
if (likely(pte_same(*ptep, pte))) {
/* Break COW */
set_huge_pte_at(mm, address, ptep,
@@ -876,10 +896,11 @@ static int hugetlb_no_page(struct mm_str
struct page *page;
struct address_space *mapping;
pte_t new_pte;
+ struct hstate *h = hstate_vma(vma);

mapping = vma->vm_file->f_mapping;
- idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+ idx = ((address - vma->vm_start) >> huge_page_shift(h))
+ + (vma->vm_pgoff >> huge_page_order(h));

/*
* Use page lock to guard against racing truncation
@@ -888,7 +909,7 @@ static int hugetlb_no_page(struct mm_str
retry:
page = find_lock_page(mapping, idx);
if (!page) {
- size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto out;
page = alloc_huge_page(vma, address);
@@ -896,7 +917,7 @@ retry:
ret = -PTR_ERR(page);
goto out;
}
- clear_huge_page(page, address);
+ clear_huge_page(page, address, huge_page_size(h));
__SetPageUptodate(page);

if (vma->vm_flags & VM_SHARED) {
@@ -912,14 +933,14 @@ retry:
}

spin_lock(&inode->i_lock);
- inode->i_blocks += BLOCKS_PER_HUGEPAGE;
+ inode->i_blocks += (huge_page_size(h)) / 512;
spin_unlock(&inode->i_lock);
} else
lock_page(page);
}

spin_lock(&mm->page_table_lock);
- size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto backout;

@@ -955,8 +976,9 @@ int hugetlb_fault(struct mm_struct *mm,
pte_t entry;
int ret;
static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+ struct hstate *h = hstate_vma(vma);

- ptep = huge_pte_alloc(mm, address);
+ ptep = huge_pte_alloc(mm, address, huge_page_size(h));
if (!ptep)
return VM_FAULT_OOM;

@@ -994,6 +1016,7 @@ int follow_hugetlb_page(struct mm_struct
unsigned long pfn_offset;
unsigned long vaddr = *position;
int remainder = *length;
+ struct hstate *h = hstate_vma(vma);

spin_lock(&mm->page_table_lock);
while (vaddr < vma->vm_end && remainder) {
@@ -1005,7 +1028,7 @@ int follow_hugetlb_page(struct mm_struct
* each hugepage. We have to make * sure we get the
* first, for the page indexing below to work.
*/
- pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);
+ pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));

if (!pte || pte_none(*pte) || (write && !pte_write(*pte))) {
int ret;
@@ -1022,7 +1045,7 @@ int follow_hugetlb_page(struct mm_struct
break;
}

- pfn_offset = (vaddr & ~HPAGE_MASK) >> PAGE_SHIFT;
+ pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
page = pte_page(*pte);
same_page:
if (pages) {
@@ -1038,7 +1061,7 @@ same_page:
--remainder;
++i;
if (vaddr < vma->vm_end && remainder &&
- pfn_offset < HPAGE_SIZE/PAGE_SIZE) {
+ pfn_offset < (1 << huge_page_order(h))) {
/*
* We use pfn_offset to avoid touching the pageframes
* of this compound page.
@@ -1060,13 +1083,14 @@ void hugetlb_change_protection(struct vm
unsigned long start = address;
pte_t *ptep;
pte_t pte;
+ struct hstate *h = hstate_vma(vma);

BUG_ON(address >= end);
flush_cache_range(vma, address, end);

spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
spin_lock(&mm->page_table_lock);
- for (; address < end; address += HPAGE_SIZE) {
+ for (; address < end; address += huge_page_size(h)) {
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;
@@ -1205,7 +1229,7 @@ static long region_truncate(struct list_
return chg;
}

-static int hugetlb_acct_memory(long delta)
+static int hugetlb_acct_memory(struct hstate *h, long delta)
{
int ret = -ENOMEM;

@@ -1228,18 +1252,18 @@ static int hugetlb_acct_memory(long delt
* semantics that cpuset has.
*/
if (delta > 0) {
- if (gather_surplus_pages(delta) < 0)
+ if (gather_surplus_pages(h, delta) < 0)
goto out;

- if (delta > cpuset_mems_nr(free_huge_pages_node)) {
- return_unused_surplus_pages(delta);
+ if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
+ return_unused_surplus_pages(h, delta);
goto out;
}
}

ret = 0;
if (delta < 0)
- return_unused_surplus_pages((unsigned long) -delta);
+ return_unused_surplus_pages(h, (unsigned long) -delta);

out:
spin_unlock(&hugetlb_lock);
@@ -1249,6 +1273,7 @@ out:
int hugetlb_reserve_pages(struct inode *inode, long from, long to)
{
long ret, chg;
+ struct hstate *h = &global_hstate;

chg = region_chg(&inode->i_mapping->private_list, from, to);
if (chg < 0)
@@ -1256,7 +1281,7 @@ int hugetlb_reserve_pages(struct inode *

if (hugetlb_get_quota(inode->i_mapping, chg))
return -ENOSPC;
- ret = hugetlb_acct_memory(chg);
+ ret = hugetlb_acct_memory(h, chg);
if (ret < 0) {
hugetlb_put_quota(inode->i_mapping, chg);
return ret;
@@ -1267,12 +1292,13 @@ int hugetlb_reserve_pages(struct inode *

void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
{
+ struct hstate *h = &global_hstate;
long chg = region_truncate(&inode->i_mapping->private_list, offset);

spin_lock(&inode->i_lock);
- inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
+ inode->i_blocks -= ((huge_page_size(h))/512) * freed;
spin_unlock(&inode->i_lock);

hugetlb_put_quota(inode->i_mapping, (chg - freed));
- hugetlb_acct_memory(-(chg - freed));
+ hugetlb_acct_memory(h, -(chg - freed));
}
Index: linux/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux/arch/powerpc/mm/hugetlbpage.c
@@ -128,7 +128,7 @@ pte_t *huge_pte_offset(struct mm_struct
return NULL;
}

-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
{
pgd_t *pg;
pud_t *pu;
Index: linux/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/sparc64/mm/hugetlbpage.c
+++ linux/arch/sparc64/mm/hugetlbpage.c
@@ -195,7 +195,7 @@ hugetlb_get_unmapped_area(struct file *f
pgoff, flags);
}

-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
{
pgd_t *pgd;
pud_t *pud;
Index: linux/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/sh/mm/hugetlbpage.c
+++ linux/arch/sh/mm/hugetlbpage.c
@@ -22,7 +22,7 @@
#include <asm/tlbflush.h>
#include <asm/cacheflush.h>

-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
{
pgd_t *pgd;
pud_t *pud;
Index: linux/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/ia64/mm/hugetlbpage.c
+++ linux/arch/ia64/mm/hugetlbpage.c
@@ -24,7 +24,7 @@
unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;

pte_t *
-huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
+huge_pte_alloc (struct mm_struct *mm, unsigned long addr, int sz)
{
unsigned long taddr = htlbpage_to_page(addr);
pgd_t *pgd;
Index: linux/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/x86/mm/hugetlbpage.c
+++ linux/arch/x86/mm/hugetlbpage.c
@@ -124,7 +124,7 @@ int huge_pmd_unshare(struct mm_struct *m
return 1;
}

-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
{
pgd_t *pgd;
pud_t *pud;
Index: linux/include/linux/hugetlb.h
===================================================================
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -40,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;

/* arch callbacks */

-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz);
pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
@@ -95,7 +95,6 @@ pte_t huge_ptep_get_and_clear(struct mm_
#else
void hugetlb_prefault_arch_hook(struct mm_struct *mm);
#endif
-
#else /* !CONFIG_HUGETLB_PAGE */

static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -169,8 +168,6 @@ struct file *hugetlb_file_setup(const ch
int hugetlb_get_quota(struct address_space *mapping, long delta);
void hugetlb_put_quota(struct address_space *mapping, long delta);

-#define BLOCKS_PER_HUGEPAGE (HPAGE_SIZE / 512)
-
static inline int is_file_hugepages(struct file *file)
{
if (file->f_op == &hugetlbfs_file_operations)
@@ -199,4 +196,69 @@ unsigned long hugetlb_get_unmapped_area(
unsigned long flags);
#endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */

+#ifdef CONFIG_HUGETLB_PAGE
+
+/* Defines one hugetlb page size */
+struct hstate {
+ int hugetlb_next_nid;
+ short order;
+ /* 2 bytes free */
+ unsigned long mask;
+ unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
+ unsigned long surplus_huge_pages;
+ unsigned long nr_overcommit_huge_pages;
+ struct list_head hugepage_freelists[MAX_NUMNODES];
+ unsigned int nr_huge_pages_node[MAX_NUMNODES];
+ unsigned int free_huge_pages_node[MAX_NUMNODES];
+ unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+};
+
+extern struct hstate global_hstate;
+
+static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
+{
+ return &global_hstate;
+}
+
+static inline struct hstate *hstate_file(struct file *f)
+{
+ return &global_hstate;
+}
+
+static inline struct hstate *hstate_inode(struct inode *i)
+{
+ return &global_hstate;
+}
+
+static inline unsigned huge_page_size(struct hstate *h)
+{
+ return PAGE_SIZE << h->order;
+}
+
+static inline unsigned long huge_page_mask(struct hstate *h)
+{
+ return h->mask;
+}
+
+static inline unsigned long huge_page_order(struct hstate *h)
+{
+ return h->order;
+}
+
+static inline unsigned huge_page_shift(struct hstate *h)
+{
+ return h->order + PAGE_SHIFT;
+}
+
+#else
+struct hstate {};
+#define hstate_file(f) NULL
+#define hstate_vma(v) NULL
+#define hstate_inode(i) NULL
+#define huge_page_size(h) PAGE_SIZE
+#define huge_page_mask(h) PAGE_MASK
+#define huge_page_order(h) 0
+#define huge_page_shift(h) PAGE_SHIFT
+#endif
+
#endif /* _LINUX_HUGETLB_H */
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c
+++ linux/fs/hugetlbfs/inode.c
@@ -80,6 +80,7 @@ static int hugetlbfs_file_mmap(struct fi
struct inode *inode = file->f_path.dentry->d_inode;
loff_t len, vma_len;
int ret;
+ struct hstate *h = hstate_file(file);

/*
* vma address alignment (but not the pgoff alignment) has
@@ -92,7 +93,7 @@ static int hugetlbfs_file_mmap(struct fi
vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
vma->vm_ops = &hugetlb_vm_ops;

- if (vma->vm_pgoff & ~(HPAGE_MASK >> PAGE_SHIFT))
+ if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
return -EINVAL;

vma_len = (loff_t)(vma->vm_end - vma->vm_start);
@@ -104,8 +105,8 @@ static int hugetlbfs_file_mmap(struct fi
len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);

if (vma->vm_flags & VM_MAYSHARE &&
- hugetlb_reserve_pages(inode, vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT),
- len >> HPAGE_SHIFT))
+ hugetlb_reserve_pages(inode, vma->vm_pgoff >> huge_page_order(h),
+ len >> huge_page_shift(h)))
goto out;

ret = 0;
@@ -130,8 +131,9 @@ hugetlb_get_unmapped_area(struct file *f
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned long start_addr;
+ struct hstate *h = hstate_file(file);

- if (len & ~HPAGE_MASK)
+ if (len & ~huge_page_mask(h))
return -EINVAL;
if (len > TASK_SIZE)
return -ENOMEM;
@@ -143,7 +145,7 @@ hugetlb_get_unmapped_area(struct file *f
}

if (addr) {
- addr = ALIGN(addr, HPAGE_SIZE);
+ addr = ALIGN(addr, huge_page_size(h));
vma = find_vma(mm, addr);
if (TASK_SIZE - len >= addr &&
(!vma || addr + len <= vma->vm_start))
@@ -156,7 +158,7 @@ hugetlb_get_unmapped_area(struct file *f
start_addr = TASK_UNMAPPED_BASE;

full_search:
- addr = ALIGN(start_addr, HPAGE_SIZE);
+ addr = ALIGN(start_addr, huge_page_size(h));

for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
/* At this point: (!vma || addr < vma->vm_end). */
@@ -174,7 +176,7 @@ full_search:

if (!vma || addr + len <= vma->vm_start)
return addr;
- addr = ALIGN(vma->vm_end, HPAGE_SIZE);
+ addr = ALIGN(vma->vm_end, huge_page_size(h));
}
}
#endif
@@ -225,10 +227,11 @@ hugetlbfs_read_actor(struct page *page,
static ssize_t hugetlbfs_read(struct file *filp, char __user *buf,
size_t len, loff_t *ppos)
{
+ struct hstate *h = hstate_file(filp);
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
- unsigned long index = *ppos >> HPAGE_SHIFT;
- unsigned long offset = *ppos & ~HPAGE_MASK;
+ unsigned long index = *ppos >> huge_page_shift(h);
+ unsigned long offset = *ppos & ~huge_page_mask(h);
unsigned long end_index;
loff_t isize;
ssize_t retval = 0;
@@ -243,17 +246,17 @@ static ssize_t hugetlbfs_read(struct fil
if (!isize)
goto out;

- end_index = (isize - 1) >> HPAGE_SHIFT;
+ end_index = (isize - 1) >> huge_page_shift(h);
for (;;) {
struct page *page;
int nr, ret;

/* nr is the maximum number of bytes to copy from this page */
- nr = HPAGE_SIZE;
+ nr = huge_page_size(h);
if (index >= end_index) {
if (index > end_index)
goto out;
- nr = ((isize - 1) & ~HPAGE_MASK) + 1;
+ nr = ((isize - 1) & ~huge_page_mask(h)) + 1;
if (nr <= offset) {
goto out;
}
@@ -287,8 +290,8 @@ static ssize_t hugetlbfs_read(struct fil
offset += ret;
retval += ret;
len -= ret;
- index += offset >> HPAGE_SHIFT;
- offset &= ~HPAGE_MASK;
+ index += offset >> huge_page_shift(h);
+ offset &= ~huge_page_mask(h);

if (page)
page_cache_release(page);
@@ -298,7 +301,7 @@ static ssize_t hugetlbfs_read(struct fil
break;
}
out:
- *ppos = ((loff_t)index << HPAGE_SHIFT) + offset;
+ *ppos = ((loff_t)index << huge_page_shift(h)) + offset;
mutex_unlock(&inode->i_mutex);
return retval;
}
@@ -339,8 +342,9 @@ static void truncate_huge_page(struct pa

static void truncate_hugepages(struct inode *inode, loff_t lstart)
{
+ struct hstate *h = hstate_inode(inode);
struct address_space *mapping = &inode->i_data;
- const pgoff_t start = lstart >> HPAGE_SHIFT;
+ const pgoff_t start = lstart >> huge_page_shift(h);
struct pagevec pvec;
pgoff_t next;
int i, freed = 0;
@@ -449,8 +453,9 @@ static int hugetlb_vmtruncate(struct ino
{
pgoff_t pgoff;
struct address_space *mapping = inode->i_mapping;
+ struct hstate *h = hstate_inode(inode);

- BUG_ON(offset & ~HPAGE_MASK);
+ BUG_ON(offset & ~huge_page_mask(h));
pgoff = offset >> PAGE_SHIFT;

i_size_write(inode, offset);
@@ -465,6 +470,7 @@ static int hugetlb_vmtruncate(struct ino
static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
{
struct inode *inode = dentry->d_inode;
+ struct hstate *h = hstate_inode(inode);
int error;
unsigned int ia_valid = attr->ia_valid;

@@ -476,7 +482,7 @@ static int hugetlbfs_setattr(struct dent

if (ia_valid & ATTR_SIZE) {
error = -EINVAL;
- if (!(attr->ia_size & ~HPAGE_MASK))
+ if (!(attr->ia_size & ~huge_page_mask(h)))
error = hugetlb_vmtruncate(inode, attr->ia_size);
if (error)
goto out;
@@ -610,9 +616,10 @@ static int hugetlbfs_set_page_dirty(stru
static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
+ struct hstate *h = hstate_inode(dentry->d_inode);

buf->f_type = HUGETLBFS_MAGIC;
- buf->f_bsize = HPAGE_SIZE;
+ buf->f_bsize = huge_page_size(h);
if (sbinfo) {
spin_lock(&sbinfo->stat_lock);
/* If no limits set, just report 0 for max/free/used
Index: linux/ipc/shm.c
===================================================================
--- linux.orig/ipc/shm.c
+++ linux/ipc/shm.c
@@ -612,7 +612,8 @@ static void shm_get_stat(struct ipc_name

if (is_file_hugepages(shp->shm_file)) {
struct address_space *mapping = inode->i_mapping;
- *rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
+ struct hstate *h = hstate_file(shp->shm_file);
+ *rss += (1 << huge_page_order(h)) * mapping->nrpages;
} else {
struct shmem_inode_info *info = SHMEM_I(inode);
spin_lock(&info->lock);
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -848,7 +848,7 @@ unsigned long unmap_vmas(struct mmu_gath
if (unlikely(is_vm_hugetlb_page(vma))) {
unmap_hugepage_range(vma, start, end);
zap_work -= (end - start) /
- (HPAGE_SIZE / PAGE_SIZE);
+ (1 << huge_page_order(hstate_vma(vma)));
start = end;
} else
start = unmap_page_range(*tlbp, vma,
Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c
+++ linux/mm/mempolicy.c
@@ -1295,7 +1295,8 @@ struct zonelist *huge_zonelist(struct vm
if (pol->policy == MPOL_INTERLEAVE) {
unsigned nid;

- nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
+ nid = interleave_nid(pol, vma, addr,
+ huge_page_shift(hstate_vma(vma)));
__mpol_free(pol); /* finished with pol */
return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
}
@@ -1939,9 +1940,12 @@ static void check_huge_range(struct vm_a
{
unsigned long addr;
struct page *page;
+ struct hstate *h = hstate_vma(vma);
+ unsigned sz = huge_page_size(h);

- for (addr = start; addr < end; addr += HPAGE_SIZE) {
- pte_t *ptep = huge_pte_offset(vma->vm_mm, addr & HPAGE_MASK);
+ for (addr = start; addr < end; addr += sz) {
+ pte_t *ptep = huge_pte_offset(vma->vm_mm,
+ addr & huge_page_mask(h));
pte_t pte;

if (!ptep)
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c
+++ linux/mm/mmap.c
@@ -1793,7 +1793,8 @@ int split_vma(struct mm_struct * mm, str
struct mempolicy *pol;
struct vm_area_struct *new;

- if (is_vm_hugetlb_page(vma) && (addr & ~HPAGE_MASK))
+ if (is_vm_hugetlb_page(vma) && (addr &
+ ~(huge_page_mask(hstate_vma(vma)))))
return -EINVAL;

if (mm->map_count >= sysctl_max_map_count)

2008-03-17 01:59:22

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [2/18] Add basic support for more than one hstate in hugetlbfs


- Convert hstates to an array
- Add a first default entry covering the standard huge page size
- Add functions for architectures to register new hstates
- Add basic iterators over hstates

Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/hugetlb.h | 10 +++++++++-
mm/hugetlb.c | 46 +++++++++++++++++++++++++++++++++++++---------
2 files changed, 46 insertions(+), 10 deletions(-)

Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -27,7 +27,15 @@ unsigned long sysctl_overcommit_huge_pag
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;

-struct hstate global_hstate;
+static int max_hstate = 1;
+
+struct hstate hstates[HUGE_MAX_HSTATE];
+
+/* for command line parsing */
+struct hstate *parsed_hstate __initdata = &global_hstate;
+
+#define for_each_hstate(h) \
+ for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)

/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -474,15 +482,11 @@ static struct page *alloc_huge_page(stru
return page;
}

-static int __init hugetlb_init(void)
+static int __init hugetlb_init_hstate(struct hstate *h)
{
unsigned long i;
- struct hstate *h = &global_hstate;

- if (HPAGE_SHIFT == 0)
- return 0;
-
- if (!h->order) {
+ if (h == &global_hstate && !h->order) {
h->order = HPAGE_SHIFT - PAGE_SHIFT;
h->mask = HPAGE_MASK;
}
@@ -497,11 +501,34 @@ static int __init hugetlb_init(void)
break;
}
max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
- printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
+
+ printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
+ h->free_huge_pages,
+ 1 << (h->order + PAGE_SHIFT - 20));
return 0;
}
+
+static int __init hugetlb_init(void)
+{
+ if (HPAGE_SHIFT == 0)
+ return 0;
+ return hugetlb_init_hstate(&global_hstate);
+}
module_init(hugetlb_init);

+/* Should be called on processing a hugepagesz=... option */
+void __init huge_add_hstate(unsigned order)
+{
+ struct hstate *h;
+ BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
+ BUG_ON(order <= HPAGE_SHIFT - PAGE_SHIFT);
+ h = &hstates[max_hstate++];
+ h->order = order;
+ h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
+ hugetlb_init_hstate(h);
+ parsed_hstate = h;
+}
+
static int __init hugetlb_setup(char *s)
{
if (sscanf(s, "%lu", &max_huge_pages) <= 0)
Index: linux/include/linux/hugetlb.h
===================================================================
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -213,7 +213,15 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
};

-extern struct hstate global_hstate;
+void __init huge_add_hstate(unsigned order);
+
+#ifndef HUGE_MAX_HSTATE
+#define HUGE_MAX_HSTATE 1
+#endif
+
+extern struct hstate hstates[HUGE_MAX_HSTATE];
+
+#define global_hstate (hstates[0])

static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
{

2008-03-17 01:59:35

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [5/18] Expand the hugetlbfs sysctls to handle arrays for all hstates


- I didn't bother with hugetlb_shm_group and treat_as_movable,
these are still single global.
- Also improve error propagation for the sysctl handlers a bit


Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/hugetlb.h | 5 +++--
kernel/sysctl.c | 2 +-
mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++------------
3 files changed, 35 insertions(+), 15 deletions(-)

Index: linux/include/linux/hugetlb.h
===================================================================
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -32,8 +32,6 @@ int hugetlb_fault(struct mm_struct *mm,
int hugetlb_reserve_pages(struct inode *inode, long from, long to);
void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);

-extern unsigned long max_huge_pages;
-extern unsigned long sysctl_overcommit_huge_pages;
extern unsigned long hugepages_treat_as_movable;
extern const unsigned long hugetlb_zero, hugetlb_infinity;
extern int sysctl_hugetlb_shm_group;
@@ -258,6 +256,9 @@ static inline unsigned huge_page_shift(s
return h->order + PAGE_SHIFT;
}

+extern unsigned long max_huge_pages[HUGE_MAX_HSTATE];
+extern unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
+
#else
struct hstate {};
#define hstate_file(f) NULL
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -935,7 +935,7 @@ static struct ctl_table vm_table[] = {
{
.procname = "nr_hugepages",
.data = &max_huge_pages,
- .maxlen = sizeof(unsigned long),
+ .maxlen = sizeof(max_huge_pages),
.mode = 0644,
.proc_handler = &hugetlb_sysctl_handler,
.extra1 = (void *)&hugetlb_zero,
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -22,8 +22,8 @@
#include "internal.h"

const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
-unsigned long max_huge_pages;
-unsigned long sysctl_overcommit_huge_pages;
+unsigned long max_huge_pages[HUGE_MAX_HSTATE];
+unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;

@@ -496,11 +496,11 @@ static int __init hugetlb_init_hstate(st

h->hugetlb_next_nid = first_node(node_online_map);

- for (i = 0; i < max_huge_pages; ++i) {
+ for (i = 0; i < max_huge_pages[h - hstates]; ++i) {
if (!alloc_fresh_huge_page(h))
break;
}
- max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
+ max_huge_pages[h - hstates] = h->free_huge_pages = h->nr_huge_pages = i;

printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
h->free_huge_pages,
@@ -531,8 +531,9 @@ void __init huge_add_hstate(unsigned ord

static int __init hugetlb_setup(char *s)
{
- if (sscanf(s, "%lu", &max_huge_pages) <= 0)
- max_huge_pages = 0;
+ unsigned long *mhp = &max_huge_pages[parsed_hstate - hstates];
+ if (sscanf(s, "%lu", mhp) <= 0)
+ *mhp = 0;
return 1;
}
__setup("hugepages=", hugetlb_setup);
@@ -584,10 +585,12 @@ static inline void try_to_free_low(unsig
#endif

#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static unsigned long set_max_huge_pages(unsigned long count)
+static unsigned long
+set_max_huge_pages(struct hstate *h, unsigned long count, int *err)
{
unsigned long min_count, ret;
- struct hstate *h = &global_hstate;
+
+ *err = 0;

/*
* Increase the pool size
@@ -659,8 +662,20 @@ int hugetlb_sysctl_handler(struct ctl_ta
struct file *file, void __user *buffer,
size_t *length, loff_t *ppos)
{
- proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
- max_huge_pages = set_max_huge_pages(max_huge_pages);
+ int err = 0;
+ struct hstate *h;
+ int i;
+ err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
+ if (err)
+ return err;
+ i = 0;
+ for_each_hstate (h) {
+ max_huge_pages[i] = set_max_huge_pages(h, max_huge_pages[i],
+ &err);
+ if (err)
+ return err;
+ i++;
+ }
return 0;
}

@@ -680,10 +695,14 @@ int hugetlb_overcommit_handler(struct ct
struct file *file, void __user *buffer,
size_t *length, loff_t *ppos)
{
- struct hstate *h = &global_hstate;
+ struct hstate *h;
+ int i = 0;
proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
spin_lock(&hugetlb_lock);
- h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
+ for_each_hstate (h) {
+ h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages[i];
+ i++;
+ }
spin_unlock(&hugetlb_lock);
return 0;
}

2008-03-17 01:58:59

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [3/18] Convert /proc output code over to report multiple hstates


I chose to just report the numbers in a row, in the hope
to minimze breakage of existing software. The "compat" page size
is always the first number.

Signed-off-by: Andi Kleen <[email protected]>

---
mm/hugetlb.c | 59 +++++++++++++++++++++++++++++++++++++++--------------------
1 file changed, 39 insertions(+), 20 deletions(-)

Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -683,37 +683,56 @@ int hugetlb_overcommit_handler(struct ct

#endif /* CONFIG_SYSCTL */

+static int dump_field(char *buf, unsigned field)
+{
+ int n = 0;
+ struct hstate *h;
+ for_each_hstate (h)
+ n += sprintf(buf + n, " %5lu", *(unsigned long *)((char *)h + field));
+ buf[n++] = '\n';
+ return n;
+}
+
int hugetlb_report_meminfo(char *buf)
{
- struct hstate *h = &global_hstate;
- return sprintf(buf,
- "HugePages_Total: %5lu\n"
- "HugePages_Free: %5lu\n"
- "HugePages_Rsvd: %5lu\n"
- "HugePages_Surp: %5lu\n"
- "Hugepagesize: %5lu kB\n",
- h->nr_huge_pages,
- h->free_huge_pages,
- h->resv_huge_pages,
- h->surplus_huge_pages,
- 1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
+ struct hstate *h;
+ int n = 0;
+ n += sprintf(buf + 0, "HugePages_Total:");
+ n += dump_field(buf + n, offsetof(struct hstate, nr_huge_pages));
+ n += sprintf(buf + n, "HugePages_Free: ");
+ n += dump_field(buf + n, offsetof(struct hstate, free_huge_pages));
+ n += sprintf(buf + n, "HugePages_Rsvd: ");
+ n += dump_field(buf + n, offsetof(struct hstate, resv_huge_pages));
+ n += sprintf(buf + n, "HugePages_Surp: ");
+ n += dump_field(buf + n, offsetof(struct hstate, surplus_huge_pages));
+ n += sprintf(buf + n, "Hugepagesize: ");
+ for_each_hstate (h)
+ n += sprintf(buf + n, " %5u", huge_page_size(h) / 1024);
+ n += sprintf(buf + n, " kB\n");
+ return n;
}

int hugetlb_report_node_meminfo(int nid, char *buf)
{
- struct hstate *h = &global_hstate;
- return sprintf(buf,
- "Node %d HugePages_Total: %5u\n"
- "Node %d HugePages_Free: %5u\n",
- nid, h->nr_huge_pages_node[nid],
- nid, h->free_huge_pages_node[nid]);
+ int n = 0;
+ n += sprintf(buf, "Node %d HugePages_Total:", nid);
+ n += dump_field(buf + n, offsetof(struct hstate,
+ nr_huge_pages_node[nid]));
+ n += sprintf(buf + n , "Node %d HugePages_Free: ", nid);
+ n += dump_field(buf + n, offsetof(struct hstate,
+ free_huge_pages_node[nid]));
+ return n;
}

/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
unsigned long hugetlb_total_pages(void)
{
- struct hstate *h = &global_hstate;
- return h->nr_huge_pages * (1 << huge_page_order(h));
+ long x = 0;
+ struct hstate *h;
+ for_each_hstate (h) {
+ x += h->nr_huge_pages * (1 << huge_page_order(h));
+ }
+ return x;
}

/*

2008-03-17 01:59:51

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [4/18] Add basic support for more than one hstate in hugetlbfs


Signed-off-by: Andi Kleen <[email protected]>

---
mm/hugetlb.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -550,26 +550,33 @@ static unsigned int cpuset_mems_nr(unsig

#ifdef CONFIG_SYSCTL
#ifdef CONFIG_HIGHMEM
-static void try_to_free_low(unsigned long count)
+static void do_try_to_free_low(struct hstate *h, unsigned long count)
{
- struct hstate *h = &global_hstate;
int i;

for (i = 0; i < MAX_NUMNODES; ++i) {
struct page *page, *next;
struct list_head *freel = &h->hugepage_freelists[i];
list_for_each_entry_safe(page, next, freel, lru) {
- if (count >= nr_huge_pages)
+ if (count >= h->nr_huge_pages)
return;
if (PageHighMem(page))
continue;
list_del(&page->lru);
- update_and_free_page(page);
+ update_and_free_page(h, page);
h->free_huge_pages--;
h->free_huge_pages_node[page_to_nid(page)]--;
}
}
}
+
+static void try_to_free_low(unsigned long count)
+{
+ struct hstate *h;
+ for_each_hstate (h) {
+ do_try_to_free_low(h, count);
+ }
+}
#else
static inline void try_to_free_low(unsigned long count)
{

2008-03-17 02:00:19

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [6/18] Add support to have individual hstates for each hugetlbfs mount


- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- Set up pointers to a suitable hstate for the set page size option
to the super block and the inode and the vma.
- Change the hstate accessors to use this information
- Add code to the hstate init function to set parsed_hstate for command
line processing
- Handle duplicated hstate registrations to the make command line user proof

Signed-off-by: Andi Kleen <[email protected]>

---
fs/hugetlbfs/inode.c | 50 ++++++++++++++++++++++++++++++++++++++----------
include/linux/hugetlb.h | 12 ++++++++---
mm/hugetlb.c | 22 +++++++++++++++++----
3 files changed, 67 insertions(+), 17 deletions(-)

Index: linux/include/linux/hugetlb.h
===================================================================
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -134,6 +134,7 @@ struct hugetlbfs_config {
umode_t mode;
long nr_blocks;
long nr_inodes;
+ struct hstate *hstate;
};

struct hugetlbfs_sb_info {
@@ -142,12 +143,14 @@ struct hugetlbfs_sb_info {
long max_inodes; /* inodes allowed */
long free_inodes; /* inodes free */
spinlock_t stat_lock;
+ struct hstate *hstate;
};


struct hugetlbfs_inode_info {
struct shared_policy policy;
struct inode vfs_inode;
+ struct hstate *hstate;
};

static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
@@ -212,6 +215,7 @@ struct hstate {
};

void __init huge_add_hstate(unsigned order);
+struct hstate *huge_lookup_hstate(unsigned long pagesize);

#ifndef HUGE_MAX_HSTATE
#define HUGE_MAX_HSTATE 1
@@ -223,17 +227,19 @@ extern struct hstate hstates[HUGE_MAX_HS

static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
{
- return &global_hstate;
+ return (struct hstate *)vma->vm_private_data;
}

static inline struct hstate *hstate_file(struct file *f)
{
- return &global_hstate;
+ struct dentry *d = f->f_dentry;
+ struct inode *i = d->d_inode;
+ return HUGETLBFS_I(i)->hstate;
}

static inline struct hstate *hstate_inode(struct inode *i)
{
- return &global_hstate;
+ return HUGETLBFS_I(i)->hstate;
}

static inline unsigned huge_page_size(struct hstate *h)
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c
+++ linux/fs/hugetlbfs/inode.c
@@ -53,6 +53,7 @@ int sysctl_hugetlb_shm_group;
enum {
Opt_size, Opt_nr_inodes,
Opt_mode, Opt_uid, Opt_gid,
+ Opt_pagesize,
Opt_err,
};

@@ -62,6 +63,7 @@ static match_table_t tokens = {
{Opt_mode, "mode=%o"},
{Opt_uid, "uid=%u"},
{Opt_gid, "gid=%u"},
+ {Opt_pagesize, "pagesize=%s"},
{Opt_err, NULL},
};

@@ -92,6 +94,7 @@ static int hugetlbfs_file_mmap(struct fi
*/
vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
vma->vm_ops = &hugetlb_vm_ops;
+ vma->vm_private_data = h;

if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
return -EINVAL;
@@ -530,6 +533,7 @@ static struct inode *hugetlbfs_get_inode
inode->i_op = &page_symlink_inode_operations;
break;
}
+ info->hstate = HUGETLBFS_SB(sb)->hstate;
}
return inode;
}
@@ -750,6 +754,8 @@ hugetlbfs_parse_options(char *options, s
char *p, *rest;
substring_t args[MAX_OPT_ARGS];
int option;
+ unsigned long long size = 0;
+ enum { NO_SIZE, SIZE_STD, SIZE_PERCENT } setsize = NO_SIZE;

if (!options)
return 0;
@@ -780,17 +786,13 @@ hugetlbfs_parse_options(char *options, s
break;

case Opt_size: {
- unsigned long long size;
/* memparse() will accept a K/M/G without a digit */
if (!isdigit(*args[0].from))
goto bad_val;
size = memparse(args[0].from, &rest);
- if (*rest == '%') {
- size <<= HPAGE_SHIFT;
- size *= max_huge_pages;
- do_div(size, 100);
- }
- pconfig->nr_blocks = (size >> HPAGE_SHIFT);
+ setsize = SIZE_STD;
+ if (*rest == '%')
+ setsize = SIZE_PERCENT;
break;
}

@@ -801,6 +803,19 @@ hugetlbfs_parse_options(char *options, s
pconfig->nr_inodes = memparse(args[0].from, &rest);
break;

+ case Opt_pagesize: {
+ unsigned long ps;
+ ps = memparse(args[0].from, &rest);
+ pconfig->hstate = huge_lookup_hstate(ps);
+ if (!pconfig->hstate) {
+ printk(KERN_ERR
+ "hugetlbfs: Unsupported page size %lu MB\n",
+ ps >> 20);
+ return -EINVAL;
+ }
+ break;
+ }
+
default:
printk(KERN_ERR "hugetlbfs: Bad mount option: \"%s\"\n",
p);
@@ -808,6 +823,18 @@ hugetlbfs_parse_options(char *options, s
break;
}
}
+
+ /* Do size after hstate is set up */
+ if (setsize > NO_SIZE) {
+ struct hstate *h = pconfig->hstate;
+ if (setsize == SIZE_PERCENT) {
+ size <<= huge_page_shift(h);
+ size *= max_huge_pages[h - hstates];
+ do_div(size, 100);
+ }
+ pconfig->nr_blocks = (size >> huge_page_shift(h));
+ }
+
return 0;

bad_val:
@@ -832,6 +859,7 @@ hugetlbfs_fill_super(struct super_block
config.uid = current->fsuid;
config.gid = current->fsgid;
config.mode = 0755;
+ config.hstate = &global_hstate;
ret = hugetlbfs_parse_options(data, &config);
if (ret)
return ret;
@@ -840,14 +868,15 @@ hugetlbfs_fill_super(struct super_block
if (!sbinfo)
return -ENOMEM;
sb->s_fs_info = sbinfo;
+ sbinfo->hstate = config.hstate;
spin_lock_init(&sbinfo->stat_lock);
sbinfo->max_blocks = config.nr_blocks;
sbinfo->free_blocks = config.nr_blocks;
sbinfo->max_inodes = config.nr_inodes;
sbinfo->free_inodes = config.nr_inodes;
sb->s_maxbytes = MAX_LFS_FILESIZE;
- sb->s_blocksize = HPAGE_SIZE;
- sb->s_blocksize_bits = HPAGE_SHIFT;
+ sb->s_blocksize = huge_page_size(config.hstate);
+ sb->s_blocksize_bits = huge_page_shift(config.hstate);
sb->s_magic = HUGETLBFS_MAGIC;
sb->s_op = &hugetlbfs_ops;
sb->s_time_gran = 1;
@@ -949,7 +978,8 @@ struct file *hugetlb_file_setup(const ch
goto out_dentry;

error = -ENOMEM;
- if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT))
+ if (hugetlb_reserve_pages(inode, 0,
+ size >> huge_page_shift(hstate_inode(inode))))
goto out_inode;

d_instantiate(dentry, inode);
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -143,7 +143,7 @@ static void update_and_free_page(struct

static void free_huge_page(struct page *page)
{
- struct hstate *h = &global_hstate;
+ struct hstate *h = huge_lookup_hstate(PAGE_SIZE << compound_order(page));
int nid = page_to_nid(page);
struct address_space *mapping;

@@ -519,7 +519,11 @@ module_init(hugetlb_init);
/* Should be called on processing a hugepagesz=... option */
void __init huge_add_hstate(unsigned order)
{
- struct hstate *h;
+ struct hstate *h = huge_lookup_hstate(PAGE_SIZE << order);
+ if (h) {
+ parsed_hstate = h;
+ return;
+ }
BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
BUG_ON(order <= HPAGE_SHIFT - PAGE_SHIFT);
h = &hstates[max_hstate++];
@@ -538,6 +542,16 @@ static int __init hugetlb_setup(char *s)
}
__setup("hugepages=", hugetlb_setup);

+struct hstate *huge_lookup_hstate(unsigned long pagesize)
+{
+ struct hstate *h;
+ for_each_hstate (h) {
+ if (huge_page_size(h) == pagesize)
+ return h;
+ }
+ return NULL;
+}
+
static unsigned int cpuset_mems_nr(unsigned int *array)
{
int node;
@@ -1345,7 +1359,7 @@ out:
int hugetlb_reserve_pages(struct inode *inode, long from, long to)
{
long ret, chg;
- struct hstate *h = &global_hstate;
+ struct hstate *h = hstate_inode(inode);

chg = region_chg(&inode->i_mapping->private_list, from, to);
if (chg < 0)
@@ -1364,7 +1378,7 @@ int hugetlb_reserve_pages(struct inode *

void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
{
- struct hstate *h = &global_hstate;
+ struct hstate *h = hstate_inode(inode);
long chg = region_truncate(&inode->i_mapping->private_list, offset);

spin_lock(&inode->i_lock);

2008-03-17 02:01:16

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [8/18] Add a __alloc_bootmem_node_nopanic


Straight forward variant of the existing __alloc_bootmem_node, only
Signed-off-by: Andi Kleen <[email protected]>

difference is that it doesn't panic on failure

Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/bootmem.h | 4 ++++
mm/bootmem.c | 12 ++++++++++++
2 files changed, 16 insertions(+)

Index: linux/mm/bootmem.c
===================================================================
--- linux.orig/mm/bootmem.c
+++ linux/mm/bootmem.c
@@ -471,6 +471,18 @@ void * __init __alloc_bootmem_node(pg_da
return __alloc_bootmem(size, align, goal);
}

+void * __init __alloc_bootmem_node_nopanic(pg_data_t *pgdat, unsigned long size,
+ unsigned long align, unsigned long goal)
+{
+ void *ptr;
+
+ ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal, 0);
+ if (ptr)
+ return ptr;
+
+ return __alloc_bootmem_nopanic(size, align, goal);
+}
+
#ifndef ARCH_LOW_ADDRESS_LIMIT
#define ARCH_LOW_ADDRESS_LIMIT 0xffffffffUL
#endif
Index: linux/include/linux/bootmem.h
===================================================================
--- linux.orig/include/linux/bootmem.h
+++ linux/include/linux/bootmem.h
@@ -90,6 +90,10 @@ extern void *__alloc_bootmem_node(pg_dat
unsigned long size,
unsigned long align,
unsigned long goal);
+extern void *__alloc_bootmem_node_nopanic(pg_data_t *pgdat,
+ unsigned long size,
+ unsigned long align,
+ unsigned long goal);
extern unsigned long init_bootmem_node(pg_data_t *pgdat,
unsigned long freepfn,
unsigned long startpfn,

2008-03-17 02:00:54

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [7/18] Abstract out the NUMA node round robin code into a separate function


Need this as a separate function for a future patch.

No behaviour change.

Signed-off-by: Andi Kleen <[email protected]>

---
mm/hugetlb.c | 37 ++++++++++++++++++++++---------------
1 file changed, 22 insertions(+), 15 deletions(-)

Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -219,6 +219,27 @@ static struct page *alloc_fresh_huge_pag
return page;
}

+/*
+ * Use a helper variable to find the next node and then
+ * copy it back to hugetlb_next_nid afterwards:
+ * otherwise there's a window in which a racer might
+ * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * But we don't need to use a spin_lock here: it really
+ * doesn't matter if occasionally a racer chooses the
+ * same nid as we do. Move nid forward in the mask even
+ * if we just successfully allocated a hugepage so that
+ * the next caller gets hugepages on the next node.
+ */
+static int huge_next_node(struct hstate *h)
+{
+ int next_nid;
+ next_nid = next_node(h->hugetlb_next_nid, node_online_map);
+ if (next_nid == MAX_NUMNODES)
+ next_nid = first_node(node_online_map);
+ h->hugetlb_next_nid = next_nid;
+ return next_nid;
+}
+
static int alloc_fresh_huge_page(struct hstate *h)
{
struct page *page;
@@ -232,21 +253,7 @@ static int alloc_fresh_huge_page(struct
page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
if (page)
ret = 1;
- /*
- * Use a helper variable to find the next node and then
- * copy it back to hugetlb_next_nid afterwards:
- * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
- * But we don't need to use a spin_lock here: it really
- * doesn't matter if occasionally a racer chooses the
- * same nid as we do. Move nid forward in the mask even
- * if we just successfully allocated a hugepage so that
- * the next caller gets hugepages on the next node.
- */
- next_nid = next_node(h->hugetlb_next_nid, node_online_map);
- if (next_nid == MAX_NUMNODES)
- next_nid = first_node(node_online_map);
- h->hugetlb_next_nid = next_nid;
+ next_nid = huge_next_node(h);
} while (!page && h->hugetlb_next_nid != start_nid);

return ret;

2008-03-17 02:01:33

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [9/18] Export prep_compound_page to the hugetlb allocator


hugetlb will need to get compound pages from bootmem to handle
the case of them being larger than MAX_ORDER. Export
the constructor function needed for this.

Signed-off-by: Andi Kleen <[email protected]>

---
mm/internal.h | 2 ++
mm/page_alloc.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -13,6 +13,8 @@

#include <linux/mm.h>

+extern void prep_compound_page(struct page *page, unsigned long order);
+
static inline void set_page_count(struct page *page, int v)
{
atomic_set(&page->_count, v);
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -272,7 +272,7 @@ static void free_compound_page(struct pa
__free_pages_ok(page, compound_order(page));
}

-static void prep_compound_page(struct page *page, unsigned long order)
+void prep_compound_page(struct page *page, unsigned long order)
{
int i;
int nr_pages = 1 << order;

2008-03-17 02:01:47

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [10/18] Factor out new huge page preparation code into separate function


Needed to avoid code duplication in follow up patches.

This happens to fix a minor bug. When alloc_bootmem_node returns
a fallback node on a different node than passed the old code
would have put it into the free lists of the wrong node.
Now it would end up in the freelist of the correct node.

Signed-off-by: Andi Kleen <[email protected]>

---
mm/hugetlb.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)

Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -200,6 +200,17 @@ static int adjust_pool_surplus(struct hs
return ret;
}

+static void huge_new_page(struct hstate *h, struct page *page)
+{
+ unsigned nid = pfn_to_nid(page_to_pfn(page));
+ set_compound_page_dtor(page, free_huge_page);
+ spin_lock(&hugetlb_lock);
+ h->nr_huge_pages++;
+ h->nr_huge_pages_node[nid]++;
+ spin_unlock(&hugetlb_lock);
+ put_page(page); /* free it into the hugepage allocator */
+}
+
static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
{
struct page *page;
@@ -207,14 +218,8 @@ static struct page *alloc_fresh_huge_pag
page = alloc_pages_node(nid,
htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
huge_page_order(h));
- if (page) {
- set_compound_page_dtor(page, free_huge_page);
- spin_lock(&hugetlb_lock);
- h->nr_huge_pages++;
- h->nr_huge_pages_node[nid]++;
- spin_unlock(&hugetlb_lock);
- put_page(page); /* free it into the hugepage allocator */
- }
+ if (page)
+ huge_new_page(h, page);

return page;
}

2008-03-17 02:02:03

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [11/18] Fix alignment bug in bootmem allocator


Without this fix bootmem can return unaligned addresses when the start of a
node is not aligned to the align value. Needed for reliably allocating
gigabyte pages.
Signed-off-by: Andi Kleen <[email protected]>

---
mm/bootmem.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/mm/bootmem.c
===================================================================
--- linux.orig/mm/bootmem.c
+++ linux/mm/bootmem.c
@@ -197,6 +197,7 @@ __alloc_bootmem_core(struct bootmem_data
{
unsigned long offset, remaining_size, areasize, preferred;
unsigned long i, start = 0, incr, eidx, end_pfn;
+ unsigned long pfn;
void *ret;

if (!size) {
@@ -239,12 +240,13 @@ __alloc_bootmem_core(struct bootmem_data
preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
incr = align >> PAGE_SHIFT ? : 1;
+ pfn = PFN_DOWN(bdata->node_boot_start);

restart_scan:
for (i = preferred; i < eidx; i += incr) {
unsigned long j;
i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
- i = ALIGN(i, incr);
+ i = ALIGN(pfn + i, incr) - pfn;
if (i >= eidx)
break;
if (test_bit(i, bdata->node_bootmem_map))

2008-03-17 02:02:27

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [12/18] Add support to allocate hugetlb pages that are larger than MAX_ORDER


This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
not practical to enlarge MAX_ORDER to 1GB.

Instead the 1GB pages are only allocated at boot using the bootmem
allocator using the hugepages=... option.

These 1G bootmem pages are never freed. In theory it would be possible
to implement that with some complications, but since it would be a one-way
street (> MAX_ORDER pages cannot be allocated later) I decided not to currently.

The > MAX_ORDER code is not ifdef'ed per architecture. It is not very big
and the ifdef uglyness seemed not be worth it.

Known problems: /proc/meminfo and "free" do not display the memory
allocated for gb pages in "Total". This is a little confusing for the
user.

Signed-off-by: Andi Kleen <[email protected]>

---
mm/hugetlb.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 62 insertions(+), 2 deletions(-)

Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -14,6 +14,7 @@
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
#include <linux/mutex.h>
+#include <linux/bootmem.h>

#include <asm/page.h>
#include <asm/pgtable.h>
@@ -153,7 +154,7 @@ static void free_huge_page(struct page *
INIT_LIST_HEAD(&page->lru);

spin_lock(&hugetlb_lock);
- if (h->surplus_huge_pages_node[nid]) {
+ if (h->surplus_huge_pages_node[nid] && h->order <= MAX_ORDER) {
update_and_free_page(h, page);
h->surplus_huge_pages--;
h->surplus_huge_pages_node[nid]--;
@@ -215,6 +216,9 @@ static struct page *alloc_fresh_huge_pag
{
struct page *page;

+ if (h->order > MAX_ORDER)
+ return NULL;
+
page = alloc_pages_node(nid,
htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
huge_page_order(h));
@@ -271,6 +275,9 @@ static struct page *alloc_buddy_huge_pag
struct page *page;
unsigned int nid;

+ if (h->order > MAX_ORDER)
+ return NULL;
+
/*
* Assume we will successfully allocate the surplus page to
* prevent racing processes from causing the surplus to exceed
@@ -422,6 +429,10 @@ return_unused_surplus_pages(struct hstat
/* Uncommit the reservation */
h->resv_huge_pages -= unused_resv_pages;

+ /* Cannot return gigantic pages currently */
+ if (h->order > MAX_ORDER)
+ return;
+
nr_pages = min(unused_resv_pages, h->surplus_huge_pages);

while (nr_pages) {
@@ -499,6 +510,44 @@ static struct page *alloc_huge_page(stru
return page;
}

+static __initdata LIST_HEAD(huge_boot_pages);
+
+struct huge_bm_page {
+ struct list_head list;
+ struct hstate *hstate;
+};
+
+static int __init alloc_bm_huge_page(struct hstate *h)
+{
+ struct huge_bm_page *m;
+ m = __alloc_bootmem_node_nopanic(NODE_DATA(h->hugetlb_next_nid),
+ huge_page_size(h), huge_page_size(h),
+ 0);
+ if (!m)
+ return 0;
+ BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
+ /* Put them into a private list first because mem_map is not up yet */
+ list_add(&m->list, &huge_boot_pages);
+ m->hstate = h;
+ huge_next_node(h);
+ return 1;
+}
+
+/* Put bootmem huge pages into the standard lists after mem_map is up */
+static int __init huge_init_bm(void)
+{
+ struct huge_bm_page *m;
+ list_for_each_entry (m, &huge_boot_pages, list) {
+ struct page *page = virt_to_page(m);
+ struct hstate *h = m->hstate;
+ __ClearPageReserved(page);
+ prep_compound_page(page, h->order);
+ huge_new_page(h, page);
+ }
+ return 0;
+}
+__initcall(huge_init_bm);
+
static int __init hugetlb_init_hstate(struct hstate *h)
{
unsigned long i;
@@ -509,7 +558,10 @@ static int __init hugetlb_init_hstate(st
h->hugetlb_next_nid = first_node(node_online_map);

for (i = 0; i < max_huge_pages[h - hstates]; ++i) {
- if (!alloc_fresh_huge_page(h))
+ if (h->order > MAX_ORDER) {
+ if (!alloc_bm_huge_page(h))
+ break;
+ } else if (!alloc_fresh_huge_page(h))
break;
}
max_huge_pages[h - hstates] = h->free_huge_pages = h->nr_huge_pages = i;
@@ -581,6 +633,9 @@ static void do_try_to_free_low(struct hs
{
int i;

+ if (h->order > MAX_ORDER)
+ return;
+
for (i = 0; i < MAX_NUMNODES; ++i) {
struct page *page, *next;
struct list_head *freel = &h->hugepage_freelists[i];
@@ -618,6 +673,11 @@ set_max_huge_pages(struct hstate *h, uns

*err = 0;

+ if (h->order > MAX_ORDER) {
+ *err = -EINVAL;
+ return max_huge_pages[h - hstates];
+ }
+
/*
* Increase the pool size
* First take pages out of surplus state. Then make up the

2008-03-17 02:02:57

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [14/18] Clean up hugetlb boot time printk


- Reword sentence to clarify meaning with multiple options
- Add support for using GB prefixes for the page size
- Add extra printk to delayed > MAX_ORDER allocation code

Signed-off-by: Andi Kleen <[email protected]>

---
mm/hugetlb.c | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)

Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -510,6 +510,15 @@ static struct page *alloc_huge_page(stru
return page;
}

+static __init char *memfmt(char *buf, unsigned long n)
+{
+ if (n >= (1UL << 30))
+ sprintf(buf, "%lu GB", n >> 30);
+ else
+ sprintf(buf, "%lu MB", n >> 20);
+ return buf;
+}
+
static __initdata LIST_HEAD(huge_boot_pages);

struct huge_bm_page {
@@ -536,14 +545,28 @@ static int __init alloc_bm_huge_page(str
/* Put bootmem huge pages into the standard lists after mem_map is up */
static int __init huge_init_bm(void)
{
+ unsigned long pages = 0;
struct huge_bm_page *m;
+ struct hstate *h = NULL;
+ char buf[32];
+
list_for_each_entry (m, &huge_boot_pages, list) {
struct page *page = virt_to_page(m);
- struct hstate *h = m->hstate;
+ h = m->hstate;
__ClearPageReserved(page);
prep_compound_page(page, h->order);
huge_new_page(h, page);
+ pages++;
}
+
+ /*
+ * This only prints for a single hstate. This works for x86-64,
+ * but if you do multiple > MAX_ORDER hstates you'll need to fix it.
+ */
+ if (pages > 0)
+ printk(KERN_INFO "HugeTLB pre-allocated %ld %s pages\n",
+ h->free_huge_pages,
+ memfmt(buf, huge_page_size(h)));
return 0;
}
__initcall(huge_init_bm);
@@ -551,6 +574,8 @@ __initcall(huge_init_bm);
static int __init hugetlb_init_hstate(struct hstate *h)
{
unsigned long i;
+ char buf[32];
+ unsigned long pages = 0;

/* Don't reinitialize lists if they have been already init'ed */
if (!h->hugepage_freelists[0].next) {
@@ -567,12 +592,14 @@ static int __init hugetlb_init_hstate(st
} else if (!alloc_fresh_huge_page(h))
break;
h->parsed_hugepages++;
+ pages++;
}
max_huge_pages[h - hstates] = h->parsed_hugepages;

- printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
+ if (pages > 0)
+ printk(KERN_INFO "HugeTLB pre-allocated %ld %s pages\n",
h->free_huge_pages,
- 1 << (h->order + PAGE_SHIFT - 20));
+ memfmt(buf, huge_page_size(h)));
return 0;
}

2008-03-17 02:03:21

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [15/18] Add support to x86-64 to allocate and lookup GB pages in hugetlb


Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/hugetlbpage.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/x86/mm/hugetlbpage.c
+++ linux/arch/x86/mm/hugetlbpage.c
@@ -133,9 +133,14 @@ pte_t *huge_pte_alloc(struct mm_struct *
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (pud) {
- if (pud_none(*pud))
- huge_pmd_share(mm, addr, pud);
- pte = (pte_t *) pmd_alloc(mm, pud, addr);
+ if (sz == PUD_SIZE) {
+ pte = (pte_t *)pud;
+ } else {
+ BUG_ON(sz != PMD_SIZE);
+ if (pud_none(*pud))
+ huge_pmd_share(mm, addr, pud);
+ pte = (pte_t *) pmd_alloc(mm, pud, addr);
+ }
}
BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));

@@ -151,8 +156,11 @@ pte_t *huge_pte_offset(struct mm_struct
pgd = pgd_offset(mm, addr);
if (pgd_present(*pgd)) {
pud = pud_offset(pgd, addr);
- if (pud_present(*pud))
+ if (pud_present(*pud)) {
+ if (pud_large(*pud))
+ return (pte_t *)pud;
pmd = pmd_offset(pud, addr);
+ }
}
return (pte_t *) pmd;
}

2008-03-17 02:02:42

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [13/18] Add support to allocate hugepages of different size with hugepages=...


Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/hugetlb.h | 1 +
mm/hugetlb.c | 23 ++++++++++++++++++-----
2 files changed, 19 insertions(+), 5 deletions(-)

Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -552,19 +552,23 @@ static int __init hugetlb_init_hstate(st
{
unsigned long i;

- for (i = 0; i < MAX_NUMNODES; ++i)
- INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+ /* Don't reinitialize lists if they have been already init'ed */
+ if (!h->hugepage_freelists[0].next) {
+ for (i = 0; i < MAX_NUMNODES; ++i)
+ INIT_LIST_HEAD(&h->hugepage_freelists[i]);

- h->hugetlb_next_nid = first_node(node_online_map);
+ h->hugetlb_next_nid = first_node(node_online_map);
+ }

- for (i = 0; i < max_huge_pages[h - hstates]; ++i) {
+ while (h->parsed_hugepages < max_huge_pages[h - hstates]) {
if (h->order > MAX_ORDER) {
if (!alloc_bm_huge_page(h))
break;
} else if (!alloc_fresh_huge_page(h))
break;
+ h->parsed_hugepages++;
}
- max_huge_pages[h - hstates] = h->free_huge_pages = h->nr_huge_pages = i;
+ max_huge_pages[h - hstates] = h->parsed_hugepages;

printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
h->free_huge_pages,
@@ -602,6 +606,15 @@ static int __init hugetlb_setup(char *s)
unsigned long *mhp = &max_huge_pages[parsed_hstate - hstates];
if (sscanf(s, "%lu", mhp) <= 0)
*mhp = 0;
+ /*
+ * Global state is always initialized later in hugetlb_init.
+ * But we need to allocate > MAX_ORDER hstates here early to still
+ * use the bootmem allocator.
+ * If you add additional hstates <= MAX_ORDER you'll need
+ * to fix that.
+ */
+ if (parsed_hstate != &global_hstate)
+ hugetlb_init_hstate(parsed_hstate);
return 1;
}
__setup("hugepages=", hugetlb_setup);
Index: linux/include/linux/hugetlb.h
===================================================================
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -212,6 +212,7 @@ struct hstate {
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+ unsigned long parsed_hugepages;
};

void __init huge_add_hstate(unsigned order);

2008-03-17 02:03:38

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [16/18] Add huge pud support to hugetlbfs


Straight forward extensions for huge pages located in the PUD
instead of PMDs.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/ia64/mm/hugetlbpage.c | 6 ++++++
arch/powerpc/mm/hugetlbpage.c | 5 +++++
arch/sh/mm/hugetlbpage.c | 5 +++++
arch/sparc64/mm/hugetlbpage.c | 5 +++++
arch/x86/mm/hugetlbpage.c | 25 ++++++++++++++++++++++++-
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 9 +++++++++
7 files changed, 59 insertions(+), 1 deletion(-)

Index: linux/include/linux/hugetlb.h
===================================================================
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -45,7 +45,10 @@ struct page *follow_huge_addr(struct mm_
int write);
struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int write);
+struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
+ pud_t *pud, int write);
int pmd_huge(pmd_t pmd);
+int pud_huge(pud_t pmd);
void hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

@@ -112,8 +115,10 @@ static inline unsigned long hugetlb_tota
#define hugetlb_report_meminfo(buf) 0
#define hugetlb_report_node_meminfo(n, buf) 0
#define follow_huge_pmd(mm, addr, pmd, write) NULL
+#define follow_huge_pud(mm, addr, pud, write) NULL
#define prepare_hugepage_range(addr,len) (-EINVAL)
#define pmd_huge(x) 0
+#define pud_huge(x) 0
#define is_hugepage_only_range(mm, addr, len) 0
#define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
#define hugetlb_fault(mm, vma, addr, write) ({ BUG(); 0; })
Index: linux/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/ia64/mm/hugetlbpage.c
+++ linux/arch/ia64/mm/hugetlbpage.c
@@ -106,6 +106,12 @@ int pmd_huge(pmd_t pmd)
{
return 0;
}
+
+int pud_huge(pud_t pud)
+{
+ return 0;
+}
+
struct page *
follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write)
{
Index: linux/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/powerpc/mm/hugetlbpage.c
+++ linux/arch/powerpc/mm/hugetlbpage.c
@@ -368,6 +368,11 @@ int pmd_huge(pmd_t pmd)
return 0;
}

+int pud_huge(pud_t pud)
+{
+ return 0;
+}
+
struct page *
follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int write)
Index: linux/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/sh/mm/hugetlbpage.c
+++ linux/arch/sh/mm/hugetlbpage.c
@@ -78,6 +78,11 @@ int pmd_huge(pmd_t pmd)
return 0;
}

+int pud_huge(pud_t pud)
+{
+ return 0;
+}
+
struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int write)
{
Index: linux/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/sparc64/mm/hugetlbpage.c
+++ linux/arch/sparc64/mm/hugetlbpage.c
@@ -294,6 +294,11 @@ int pmd_huge(pmd_t pmd)
return 0;
}

+int pud_huge(pud_t pud)
+{
+ return 0;
+}
+
struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int write)
{
Index: linux/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/x86/mm/hugetlbpage.c
+++ linux/arch/x86/mm/hugetlbpage.c
@@ -196,6 +196,11 @@ int pmd_huge(pmd_t pmd)
return 0;
}

+int pud_huge(pud_t pud)
+{
+ return 0;
+}
+
struct page *
follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int write)
@@ -216,6 +221,11 @@ int pmd_huge(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PSE);
}

+int pud_huge(pud_t pud)
+{
+ return !!(pud_val(pud) & _PAGE_PSE);
+}
+
struct page *
follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int write)
@@ -224,9 +234,22 @@ follow_huge_pmd(struct mm_struct *mm, un

page = pte_page(*(pte_t *)pmd);
if (page)
- page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
+ page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
+ return page;
+}
+
+struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+ pud_t *pud, int write)
+{
+ struct page *page;
+
+ page = pte_page(*(pte_t *)pud);
+ if (page)
+ page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
return page;
}
+
#endif

/* x86_64 also uses this file */
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -1206,6 +1206,15 @@ int hugetlb_fault(struct mm_struct *mm,
return ret;
}

+/* Can be overriden by architectures */
+__attribute__((weak)) struct page *
+follow_huge_pud(struct mm_struct *mm, unsigned long address,
+ pud_t *pud, int write)
+{
+ BUG();
+ return NULL;
+}
+
int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
unsigned long *position, int *length, int i,

2008-03-17 02:03:53

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [17/18] Add huge pud support to mm/memory.c


mm/memory.c seems to have already gained some knowledge about huge pages:
in particularly in get_user_pages. Fix that code up to support huge
puds.

Signed-off-by: Andi Kleen <[email protected]>

---
mm/memory.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -931,7 +931,13 @@ struct page *follow_page(struct vm_area_
pud = pud_offset(pgd, address);
if (pud_none(*pud) || unlikely(pud_bad(*pud)))
goto no_page_table;
-
+
+ if (pud_huge(*pud)) {
+ BUG_ON(flags & FOLL_GET);
+ page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
+ goto out;
+ }
+
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
goto no_page_table;
@@ -1422,6 +1428,8 @@ static int apply_to_pmd_range(struct mm_
unsigned long next;
int err;

+ BUG_ON(pud_huge(*pud));
+
pmd = pmd_alloc(mm, pud, addr);
if (!pmd)
return -ENOMEM;

2008-03-17 02:04:22

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [18/18] Implement hugepagesz= option for x86-64


Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.

This finally allows to select GB pages for hugetlbfs in x86 now
that all the infrastructure is in place.

Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/kernel-parameters.txt | 11 +++++++++--
arch/x86/mm/hugetlbpage.c | 17 +++++++++++++++++
include/asm-x86/page.h | 2 ++
3 files changed, 28 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/x86/mm/hugetlbpage.c
+++ linux/arch/x86/mm/hugetlbpage.c
@@ -421,3 +421,20 @@ hugetlb_get_unmapped_area(struct file *f

#endif /*HAVE_ARCH_HUGETLB_UNMAPPED_AREA*/

+#ifdef CONFIG_X86_64
+static __init int setup_hugepagesz(char *opt)
+{
+ unsigned long ps = memparse(opt, &opt);
+ if (ps == PMD_SIZE) {
+ huge_add_hstate(PMD_SHIFT - PAGE_SHIFT);
+ } else if (ps == PUD_SIZE && cpu_has_gbpages) {
+ huge_add_hstate(PUD_SHIFT - PAGE_SHIFT);
+ } else {
+ printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n",
+ ps >> 20);
+ return 0;
+ }
+ return 1;
+}
+__setup("hugepagesz=", setup_hugepagesz);
+#endif
Index: linux/include/asm-x86/page.h
===================================================================
--- linux.orig/include/asm-x86/page.h
+++ linux/include/asm-x86/page.h
@@ -21,6 +21,8 @@
#define HPAGE_MASK (~(HPAGE_SIZE - 1))
#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)

+#define HUGE_MAX_HSTATE 2
+
/* to align the pointer to the (next) page boundary */
#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)

Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -726,8 +726,15 @@ and is between 256 and 4096 characters.
hisax= [HW,ISDN]
See Documentation/isdn/README.HiSax.

- hugepages= [HW,X86-32,IA-64] Maximal number of HugeTLB pages.
- hugepagesz= [HW,IA-64,PPC] The size of the HugeTLB pages.
+ hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
+ hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
+ On x86 this option can be specified multiple times
+ interleaved with hugepages= to reserve huge pages
+ of different sizes. Valid pages sizes on x86-64
+ are 2M (when the CPU supports "pse") and 1G (when the
+ CPU supports the "pdpe1gb" cpuinfo flag)
+ Note that 1GB pages can only be allocated at boot time
+ using hugepages= and not freed afterwards.

i8042.direct [HW] Put keyboard port into non-translated mode
i8042.dumbkbd [HW] Pretend that controller can only read data from

2008-03-17 02:19:55

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

On Sun, Mar 16, 2008 at 6:58 PM, Andi Kleen <[email protected]> wrote:
>
> Without this fix bootmem can return unaligned addresses when the start of a
> node is not aligned to the align value. Needed for reliably allocating
> gigabyte pages.
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/bootmem.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux/mm/bootmem.c
> ===================================================================
> --- linux.orig/mm/bootmem.c
> +++ linux/mm/bootmem.c
> @@ -197,6 +197,7 @@ __alloc_bootmem_core(struct bootmem_data
> {
> unsigned long offset, remaining_size, areasize, preferred;
> unsigned long i, start = 0, incr, eidx, end_pfn;
> + unsigned long pfn;
> void *ret;
>
> if (!size) {
> @@ -239,12 +240,13 @@ __alloc_bootmem_core(struct bootmem_data
> preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
> areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
> incr = align >> PAGE_SHIFT ? : 1;
> + pfn = PFN_DOWN(bdata->node_boot_start);
>
> restart_scan:
> for (i = preferred; i < eidx; i += incr) {
> unsigned long j;
> i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
> - i = ALIGN(i, incr);
> + i = ALIGN(pfn + i, incr) - pfn;
> if (i >= eidx)
> break;
> if (test_bit(i, bdata->node_bootmem_map))
> --

node_boot_start is not page aligned?

YH

2008-03-17 03:11:44

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] [0/18] GB pages hugetlb support

Andi,

Are all the "interesting" cpuset related changes in patch:

[PATCH] [1/18] Convert hugeltlb.c over to pass global state around in a structure

?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-03-17 05:35:54

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] [0/18] GB pages hugetlb support

What kernel version is this patchset against ... apparently not 2.6.25-rc5-mm1.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-03-17 06:56:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [0/18] GB pages hugetlb support

On Mon, Mar 17, 2008 at 12:35:22AM -0500, Paul Jackson wrote:
> What kernel version is this patchset against ... apparently not 2.6.25-rc5-mm1.
This was against 2.6.25-rc4

-Andi

2008-03-17 06:58:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [0/18] GB pages hugetlb support

On Sun, Mar 16, 2008 at 10:11:32PM -0500, Paul Jackson wrote:
> Andi,
>
> Are all the "interesting" cpuset related changes in patch:
>
> [PATCH] [1/18] Convert hugeltlb.c over to pass global state around in a structure

That one and Add basic support for more than one hstate in hugetlbfs
and partly Add support to have individual hstates for each hugetlbfs mount
It all builds on each other.
Ideally look at the end result of the whole series.

-Andi

2008-03-17 06:59:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

> node_boot_start is not page aligned?

It is, but it is not necessarily GB aligned and without this
change sometimes alloc_bootmem when requesting GB alignment
doesn't return GB aligned memory. This was a nasty problem
that took some time to track down.

-Andi

2008-03-17 07:00:30

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] [0/18] GB pages hugetlb support

Andi wrote:
> This was against 2.6.25-rc4

Ok - I'll try that one.

> Ideally look at the end result of the whole series.

Ok. Thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-03-17 07:17:41

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

On Mon, Mar 17, 2008 at 12:02 AM, Andi Kleen <[email protected]> wrote:
> > node_boot_start is not page aligned?
>
> It is, but it is not necessarily GB aligned and without this
> change sometimes alloc_bootmem when requesting GB alignment
> doesn't return GB aligned memory. This was a nasty problem
> that took some time to track down.

or preferred has some problem?

preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;

YH

2008-03-17 07:27:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [0/18] GB pages hugetlb support

On Mon, Mar 17, 2008 at 02:00:18AM -0500, Paul Jackson wrote:
> Andi wrote:
> > This was against 2.6.25-rc4
>
> Ok - I'll try that one.

I just updated to 2.6.25-rc6 base on
ftp://firstfloor.org/pub/ak/gbpages/patches/
and gave it a quick test. So you can use that one too.

It only had a single easy reject.

-Andi

2008-03-17 07:31:41

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

On Mon, Mar 17, 2008 at 12:17 AM, Yinghai Lu <[email protected]> wrote:
>
> On Mon, Mar 17, 2008 at 12:02 AM, Andi Kleen <[email protected]> wrote:
> > > node_boot_start is not page aligned?
> >
> > It is, but it is not necessarily GB aligned and without this
> > change sometimes alloc_bootmem when requesting GB alignment
> > doesn't return GB aligned memory. This was a nasty problem
> > that took some time to track down.
>
> or preferred has some problem?
>
>
> preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
>

when node_boot_start is 512M alignment, and align is 1024M, offset
could be 512M. it seems
i = ALIGN(i, incr) need to do sth with offset...

YH

2008-03-17 07:39:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

> when node_boot_start is 512M alignment, and align is 1024M, offset
> could be 512M. it seems
> i = ALIGN(i, incr) need to do sth with offset...

It's possible that there are better fixes for this, but at least
my simple patch seems to work here. I admit I was banging my
head against this for some time and when I did the fix I just
wanted the bug to go away and didn't really go for subtleness.

The bootmem allocator is quite spaghetti in fact, it could
really need some general clean up (although it's' not quite
as bad yet as page_alloc.c)

-Andi

2008-03-17 07:53:55

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

On Mon, Mar 17, 2008 at 12:41 AM, Andi Kleen <[email protected]> wrote:
> > when node_boot_start is 512M alignment, and align is 1024M, offset
> > could be 512M. it seems
> > i = ALIGN(i, incr) need to do sth with offset...
>
> It's possible that there are better fixes for this, but at least
> my simple patch seems to work here. I admit I was banging my
> head against this for some time and when I did the fix I just
> wanted the bug to go away and didn't really go for subtleness.
>
> The bootmem allocator is quite spaghetti in fact, it could
> really need some general clean up (although it's' not quite
> as bad yet as page_alloc.c)

i = ALIGN(i+offset, incr) - offset;

also the one in fail_block...

only happen when align is large than alignment of node_boot_start.

YH

2008-03-17 08:09:56

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] [4/18] Add basic support for more than one hstate in hugetlbfs

Andi,

Seems to me that both patches 2/18 and 4/18 are called:

Add basic support for more than one hstate in hugetlbfs

You probably want to change this detail.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-03-17 08:13:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [4/18] Add basic support for more than one hstate in hugetlbfs

On Mon, Mar 17, 2008 at 03:09:42AM -0500, Paul Jackson wrote:
> Andi,
>
> Seems to me that both patches 2/18 and 4/18 are called:
>
> Add basic support for more than one hstate in hugetlbfs
>
> You probably want to change this detail.

Fixed thanks. Indeed description went wrong on 4/18
2/ was the correct one.

-Andi

2008-03-17 08:15:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

On Mon, Mar 17, 2008 at 01:10:31AM -0700, Yinghai Lu wrote:
> please check the one against -mm and x86.git

No offset is not enough because it is still relative to the zone
start. I'm preparing an updated patch.

-Andi

2008-03-17 08:56:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

> only happen when align is large than alignment of node_boot_start.

Here's an updated version of the patch with this addressed.
Please review. The patch is somewhat more complicated, but
actually makes the code a little cleaner now.

-Andi


Fix alignment bug in bootmem allocator

Without this fix bootmem can return unaligned addresses when the start of a
node is not aligned to the align value. Needed for reliably allocating
gigabyte pages.

I removed the offset variable because all tests should align themself correctly
now. Slight drawback might be that the bootmem allocator will spend
some more time skipping bits in the bitmap initially, but that shouldn't
be a big issue.

Signed-off-by: Andi Kleen <[email protected]>

---
mm/bootmem.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

Index: linux/mm/bootmem.c
===================================================================
--- linux.orig/mm/bootmem.c
+++ linux/mm/bootmem.c
@@ -195,8 +195,9 @@ void * __init
__alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
unsigned long align, unsigned long goal, unsigned long limit)
{
- unsigned long offset, remaining_size, areasize, preferred;
- unsigned long i, start = 0, incr, eidx, end_pfn;
+ unsigned long remaining_size, areasize, preferred;
+ unsigned long i, start, incr, eidx, end_pfn;
+ unsigned long pfn;
void *ret;

if (!size) {
@@ -218,10 +219,6 @@ __alloc_bootmem_core(struct bootmem_data
end_pfn = limit;

eidx = end_pfn - PFN_DOWN(bdata->node_boot_start);
- offset = 0;
- if (align && (bdata->node_boot_start & (align - 1UL)) != 0)
- offset = align - (bdata->node_boot_start & (align - 1UL));
- offset = PFN_DOWN(offset);

/*
* We try to allocate bootmem pages above 'goal'
@@ -236,15 +233,18 @@ __alloc_bootmem_core(struct bootmem_data
} else
preferred = 0;

- preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
+ start = bdata->node_boot_start;
+ preferred = PFN_DOWN(ALIGN(preferred + start, align) - start);
areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
incr = align >> PAGE_SHIFT ? : 1;
+ pfn = PFN_DOWN(start);
+ start = 0;

restart_scan:
for (i = preferred; i < eidx; i += incr) {
unsigned long j;
i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
- i = ALIGN(i, incr);
+ i = ALIGN(pfn + i, incr) - pfn;
if (i >= eidx)
break;
if (test_bit(i, bdata->node_bootmem_map))
@@ -258,11 +258,11 @@ restart_scan:
start = i;
goto found;
fail_block:
- i = ALIGN(j, incr);
+ i = ALIGN(j + pfn, incr) - pfn;
}

- if (preferred > offset) {
- preferred = offset;
+ if (preferred > 0) {
+ preferred = 0;
goto restart_scan;
}
return NULL;
@@ -278,7 +278,7 @@ found:
*/
if (align < PAGE_SIZE &&
bdata->last_offset && bdata->last_pos+1 == start) {
- offset = ALIGN(bdata->last_offset, align);
+ unsigned long offset = ALIGN(bdata->last_offset, align);
BUG_ON(offset > PAGE_SIZE);
remaining_size = PAGE_SIZE - offset;
if (size < remaining_size) {

2008-03-17 09:27:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] [0/18] GB pages hugetlb support

Andi wrote:
> I hacked in also cpuset support. It would be good if
> Paul double checked that.

Well, from what I can see, Ken Chen wrote the code that deals with
constraints on hugetlb allocation. So I'll copy him on this reply,
along with the other two subject matter experts I know of in this area,
Christoph Lameter and Adam Litke.

The following is the only cpuset related change I saw in this
patchset. It looks pretty obvious to me ... just changing the code to
adapt to Andi's new 'struct hstate' for holding what had been global
hugetlb state.

@@ -1228,18 +1252,18 @@ static int hugetlb_acct_memory(long delt
* semantics that cpuset has.
*/
if (delta > 0) {
- if (gather_surplus_pages(delta) < 0)
+ if (gather_surplus_pages(h, delta) < 0)
goto out;

- if (delta > cpuset_mems_nr(free_huge_pages_node)) {
- return_unused_surplus_pages(delta);
+ if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
+ return_unused_surplus_pages(h, delta);
goto out;
}
}


Andi claimed, in one of his replies earlier on this thread, that there
were further interactions with cpusets and later patches in the set
that "Add basic support for more than one hstate in hugetlbfs
and partly Add support to have individual hstates for each hugetlbfs
mount", but I'm not understanding what that interaction is yet.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-03-17 09:29:49

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] [18/18] Implement hugepagesz= option for x86-64

Andi wrote:
+ hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
+ hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
+ On x86 this option can be specified multiple times
+ interleaved with hugepages= to reserve huge pages
+ of different sizes. Valid pages sizes on x86-64
+ are 2M (when the CPU supports "pse") and 1G (when the
+ CPU supports the "pdpe1gb" cpuinfo flag)
+ Note that 1GB pages can only be allocated at boot time
+ using hugepages= and not freed afterwards.

This seems to say that hugepages are required for hugepagesz to be
useful, but hugepagesz is supported on PPC, whereas hugepages is not
supported on PPC ...odd.

Should those two HW lists be the same (and sorted in the same order,
for ease of reading)?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-03-17 10:00:13

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [18/18] Implement hugepagesz= option for x86-64

On Mon, Mar 17, 2008 at 04:29:39AM -0500, Paul Jackson wrote:
> Andi wrote:
> + hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
> + hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
> + On x86 this option can be specified multiple times
> + interleaved with hugepages= to reserve huge pages
> + of different sizes. Valid pages sizes on x86-64
> + are 2M (when the CPU supports "pse") and 1G (when the
> + CPU supports the "pdpe1gb" cpuinfo flag)
> + Note that 1GB pages can only be allocated at boot time
> + using hugepages= and not freed afterwards.
>
> This seems to say that hugepages are required for hugepagesz to be

Yes, but that was already there before. I didn't change it.

I agree it should be fixed, but i would prefer to not mix
PPC specific patches into my patchkit so I hope someone
else will do that afterwards.

> useful, but hugepagesz is supported on PPC, whereas hugepages is not
> supported on PPC ...odd.
>
> Should those two HW lists be the same (and sorted in the same order,
> for ease of reading)?

Not all architectures support hugepagesz=, in particular i386
does not and possibly others. It is implemented by arch specific
code.

-Andi

2008-03-17 10:02:24

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] [18/18] Implement hugepagesz= option for x86-64

Andi wrote:
> Yes, but that was already there before. I didn't change it.
>
> I agree it should be fixed, but i would prefer to not mix
> PPC specific patches into my patchkit

Ok - good plan.

Do you know offhand what would be the correct HW list for hugepages and
hugepagesz?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214

2008-03-17 15:03:48

by Adam Litke

[permalink] [raw]
Subject: Re: [PATCH] [0/18] GB pages hugetlb support


On Mon, 2008-03-17 at 02:58 +0100, Andi Kleen wrote:
<snip>
> - lockdep sometimes complains about recursive page_table_locks
> for shared hugetlb memory, but as far as I can see I didn't
> actually change this area. Looks a little dubious, might
> be a false positive too.

I bet copy_hugetlb_page_range() is causing your complaints. It takes
the dest_mm->page_table_lock followed by src_mm->page_table_lock inside
a loop and hasn't yet been converted to call spin_lock_nested(). A
harmless false positive.

> - hugemmap04 from LTP fails. Cause unknown currently

I am not sure how well LTP is tracking mainline development in this
area. How do these patches do with the libhugetlbfs test suite? We are
adding support for ginormous pages (1GB, 16GB, etc) but it is not
complete. Should run fine with 2M pages though.

Before you ask, here is the link:
http://libhugetlbfs.ozlabs.org/snapshots/libhugetlbfs-dev-20080310.tar.gz

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2008-03-17 15:30:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [0/18] GB pages hugetlb support

> I bet copy_hugetlb_page_range() is causing your complaints. It takes
> the dest_mm->page_table_lock followed by src_mm->page_table_lock inside
> a loop and hasn't yet been converted to call spin_lock_nested(). A
> harmless false positive.

Yes. Looking at the warning I'm not sure why lockdep doesn't filter
it out automatically. I cannot think of a legitimate case where
a "possible recursive lock" with different lock addresses would be
a genuine bug.

So instead of a false positive, it's more like a "always false" :)

>
> > - hugemmap04 from LTP fails. Cause unknown currently
>
> I am not sure how well LTP is tracking mainline development in this
> area. How do these patches do with the libhugetlbfs test suite? We are

I wasn't aware of that one.

-Andi

2008-03-17 15:57:33

by Adam Litke

[permalink] [raw]
Subject: Re: [PATCH] [0/18] GB pages hugetlb support


On Mon, 2008-03-17 at 16:33 +0100, Andi Kleen wrote:
> > I bet copy_hugetlb_page_range() is causing your complaints. It takes
> > the dest_mm->page_table_lock followed by src_mm->page_table_lock inside
> > a loop and hasn't yet been converted to call spin_lock_nested(). A
> > harmless false positive.
>
> Yes. Looking at the warning I'm not sure why lockdep doesn't filter
> it out automatically. I cannot think of a legitimate case where
> a "possible recursive lock" with different lock addresses would be
> a genuine bug.
>
> So instead of a false positive, it's more like a "always false" :)
>
> >
> > > - hugemmap04 from LTP fails. Cause unknown currently
> >
> > I am not sure how well LTP is tracking mainline development in this
> > area. How do these patches do with the libhugetlbfs test suite? We are
>
> I wasn't aware of that one.

Libhugetlbfs comes with a rigorous functional test suite. It has test
cases for specific bugs that have since been fixed. I ran it on your
patches and got an oops around hugetlb_overcommit_handler() when running
the 'counters' test.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2008-03-17 18:52:54

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

On Mon, Mar 17, 2008 at 1:56 AM, Andi Kleen <[email protected]> wrote:
> > only happen when align is large than alignment of node_boot_start.
>
> Here's an updated version of the patch with this addressed.
> Please review. The patch is somewhat more complicated, but
> actually makes the code a little cleaner now.
>
> -Andi
>
>
> Fix alignment bug in bootmem allocator
>
>
> Without this fix bootmem can return unaligned addresses when the start of a
> node is not aligned to the align value. Needed for reliably allocating
> gigabyte pages.
>
> I removed the offset variable because all tests should align themself correctly
> now. Slight drawback might be that the bootmem allocator will spend
> some more time skipping bits in the bitmap initially, but that shouldn't
> be a big issue.
>
>
> Signed-off-by: Andi Kleen <[email protected]>
>
how about create local node_boot_start and node_bootmem_map that make
sure node_boot_start has bigger alignment than align input.

YH

2008-03-17 20:14:01

by Adam Litke

[permalink] [raw]
Subject: Re: [PATCH] [1/18] Convert hugeltlb.c over to pass global state around in a structure

I didn't see anything fundamentally wrong with this... In fact it is
looking really nice notwithstanding the minor nits below.

On Mon, 2008-03-17 at 02:58 +0100, Andi Kleen wrote:
> Large, but rather mechanical patch that converts most of the hugetlb.c
> globals into structure members and passes them around.
>
> Right now there is only a single global hstate structure, but
> most of the infrastructure to extend it is there.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
<snip>
> @@ -117,23 +113,24 @@ static struct page *dequeue_huge_page_vm
> return page;
> }
>
> -static void update_and_free_page(struct page *page)
> +static void update_and_free_page(struct hstate *h, struct page *page)
> {
> int i;
> - nr_huge_pages--;
> - nr_huge_pages_node[page_to_nid(page)]--;
> - for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
> + h->nr_huge_pages--;
> + h->nr_huge_pages_node[page_to_nid(page)]--;
> + for (i = 0; i < (1 << huge_page_order(h)); i++) {
> page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
> 1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
> 1 << PG_private | 1<< PG_writeback);
> }

Could you define a macro for (1 << huge_page_order(h))? It is used at
least 4 times. How about something like pages_per_huge_page(h) or
something? I think that would convey the meaning more clearly.

<snip>

> @@ -190,18 +187,18 @@ static int adjust_pool_surplus(int delta
> return ret;
> }
>
> -static struct page *alloc_fresh_huge_page_node(int nid)
> +static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> {
> struct page *page;
>
> page = alloc_pages_node(nid,
> htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
> - HUGETLB_PAGE_ORDER);
> + huge_page_order(h));

Whitespace?

<snip>

> @@ -272,17 +270,17 @@ static struct page *alloc_buddy_huge_pag
> * per-node value is checked there.
> */
> spin_lock(&hugetlb_lock);
> - if (surplus_huge_pages >= nr_overcommit_huge_pages) {
> + if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> spin_unlock(&hugetlb_lock);
> return NULL;
> } else {
> - nr_huge_pages++;
> - surplus_huge_pages++;
> + h->nr_huge_pages++;
> + h->surplus_huge_pages++;
> }
> spin_unlock(&hugetlb_lock);
>
> page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> - HUGETLB_PAGE_ORDER);
> + huge_page_order(h));

Whitespace?

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2008-03-17 20:21:19

by Adam Litke

[permalink] [raw]
Subject: Re: [PATCH] [2/18] Add basic support for more than one hstate in hugetlbfs


On Mon, 2008-03-17 at 02:58 +0100, Andi Kleen wrote:
> - Convert hstates to an array
> - Add a first default entry covering the standard huge page size
> - Add functions for architectures to register new hstates
> - Add basic iterators over hstates
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
<snip>
> @@ -497,11 +501,34 @@ static int __init hugetlb_init(void)
> break;
> }
> max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> - printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
> +
> + printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
> + h->free_huge_pages,
> + 1 << (h->order + PAGE_SHIFT - 20));
> return 0;
> }

I'd like to avoid assuming the huge page size is some multiple of MB.
PowerPC will have a 64KB huge page. Granted, you do fix this in a later
patch, so as long as the whole series goes together this shouldn't cause
a problem.

> +
> +static int __init hugetlb_init(void)
> +{
> + if (HPAGE_SHIFT == 0)
> + return 0;
> + return hugetlb_init_hstate(&global_hstate);
> +}
> module_init(hugetlb_init);
>
> +/* Should be called on processing a hugepagesz=... option */
> +void __init huge_add_hstate(unsigned order)
> +{
> + struct hstate *h;
> + BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
> + BUG_ON(order <= HPAGE_SHIFT - PAGE_SHIFT);
> + h = &hstates[max_hstate++];
> + h->order = order;
> + h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
> + hugetlb_init_hstate(h);
> + parsed_hstate = h;
> +}

Since mask can always be derived from order, is there a reason we don't
always calculate it? I guess it boils down to storage cost vs.
calculation cost and I don't feel too strongly either way.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2008-03-17 20:27:19

by Adam Litke

[permalink] [raw]
Subject: Re: [PATCH] [4/18] Add basic support for more than one hstate in hugetlbfs

With this patch you will call try_to_free_low on all registered page
sizes. As written, when a user reduces the number of one page size, all
page sizes could be affected. I don't think that's what you want to do.
Perhaps just call do_try_to_free_low() on the hstate in question.

On Mon, 2008-03-17 at 02:58 +0100, Andi Kleen wrote:
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/hugetlb.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -550,26 +550,33 @@ static unsigned int cpuset_mems_nr(unsig
>
> #ifdef CONFIG_SYSCTL
> #ifdef CONFIG_HIGHMEM
> -static void try_to_free_low(unsigned long count)
> +static void do_try_to_free_low(struct hstate *h, unsigned long count)
> {
> - struct hstate *h = &global_hstate;
> int i;
>
> for (i = 0; i < MAX_NUMNODES; ++i) {
> struct page *page, *next;
> struct list_head *freel = &h->hugepage_freelists[i];
> list_for_each_entry_safe(page, next, freel, lru) {
> - if (count >= nr_huge_pages)
> + if (count >= h->nr_huge_pages)
> return;
> if (PageHighMem(page))
> continue;
> list_del(&page->lru);
> - update_and_free_page(page);
> + update_and_free_page(h, page);
> h->free_huge_pages--;
> h->free_huge_pages_node[page_to_nid(page)]--;
> }
> }
> }
> +
> +static void try_to_free_low(unsigned long count)
> +{
> + struct hstate *h;
> + for_each_hstate (h) {
> + do_try_to_free_low(h, count);
> + }
> +}
> #else
> static inline void try_to_free_low(unsigned long count)
> {
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2008-03-17 20:30:31

by Adam Litke

[permalink] [raw]
Subject: Re: [PATCH] [10/18] Factor out new huge page preparation code into separate function


On Mon, 2008-03-17 at 02:58 +0100, Andi Kleen wrote:
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -200,6 +200,17 @@ static int adjust_pool_surplus(struct hs
> return ret;
> }
>
> +static void huge_new_page(struct hstate *h, struct page *page)
> +{
> + unsigned nid = pfn_to_nid(page_to_pfn(page));
> + set_compound_page_dtor(page, free_huge_page);
> + spin_lock(&hugetlb_lock);
> + h->nr_huge_pages++;
> + h->nr_huge_pages_node[nid]++;
> + spin_unlock(&hugetlb_lock);
> + put_page(page); /* free it into the hugepage allocator */
> +}
> +
> static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> {
> struct page *page;

We do not usually preface functions in mm/hugetlb.c with "huge" and the
name you have chosen doesn't seem that clear to me anyway. Could we
rename it to prep_new_huge_page() or something similar?

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2008-03-17 20:42:15

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [2/18] Add basic support for more than one hstate in hugetlbfs

> PowerPC will have a 64KB huge page. Granted, you do fix this in a later
> patch, so as long as the whole series goes together this shouldn't cause
> a problem.

No the later patch only supports GB and MB. If you want KB
you have to do it yourself.

But my patch just keeps the KB support as it was before.
>
> Since mask can always be derived from order, is there a reason we don't

If there was a reason I forgot it. Doesn't really matter much either
way.

-Andi

2008-03-17 21:27:32

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

On Mon, Mar 17, 2008 at 11:52 AM, Yinghai Lu <[email protected]> wrote:
> On Mon, Mar 17, 2008 at 1:56 AM, Andi Kleen <[email protected]> wrote:
> > > only happen when align is large than alignment of node_boot_start.
> >
> > Here's an updated version of the patch with this addressed.
> > Please review. The patch is somewhat more complicated, but
> > actually makes the code a little cleaner now.
> >
> > -Andi
> >
> >
> > Fix alignment bug in bootmem allocator
> >
> >
> > Without this fix bootmem can return unaligned addresses when the start of a
> > node is not aligned to the align value. Needed for reliably allocating
> > gigabyte pages.
> >
> > I removed the offset variable because all tests should align themself correctly
> > now. Slight drawback might be that the bootmem allocator will spend
> > some more time skipping bits in the bitmap initially, but that shouldn't
> > be a big issue.
> >
> >
> > Signed-off-by: Andi Kleen <[email protected]>
> >
> how about create local node_boot_start and node_bootmem_map that make
> sure node_boot_start has bigger alignment than align input.

please check it

YH


Attachments:
(No filename) (1.13 kB)
offset_alloc_bootmem_v2.patch (4.72 kB)
Download all attachments

2008-03-18 02:07:00

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

On Mon, Mar 17, 2008 at 2:27 PM, Yinghai Lu <[email protected]> wrote:
>
> On Mon, Mar 17, 2008 at 11:52 AM, Yinghai Lu <[email protected]> wrote:
> > On Mon, Mar 17, 2008 at 1:56 AM, Andi Kleen <[email protected]> wrote:
> > > > only happen when align is large than alignment of node_boot_start.
> > >
> > > Here's an updated version of the patch with this addressed.
> > > Please review. The patch is somewhat more complicated, but
> > > actually makes the code a little cleaner now.
> > >
> > > -Andi
> > >
> > >
> > > Fix alignment bug in bootmem allocator
> > >
> > >
> > > Without this fix bootmem can return unaligned addresses when the start of a
> > > node is not aligned to the align value. Needed for reliably allocating
> > > gigabyte pages.
> > >
> > > I removed the offset variable because all tests should align themself correctly
> > > now. Slight drawback might be that the bootmem allocator will spend
> > > some more time skipping bits in the bitmap initially, but that shouldn't
> > > be a big issue.
> > >
> > >
> > > Signed-off-by: Andi Kleen <[email protected]>
> > >
> > how about create local node_boot_start and node_bootmem_map that make
> > sure node_boot_start has bigger alignment than align input.
>
> please check it
>

please don't use v2... it doesn't work.

YH

2008-03-18 12:05:28

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [1/18] Convert hugeltlb.c over to pass global state around in a structure

On (17/03/08 02:58), Andi Kleen didst pronounce:
> Large, but rather mechanical patch that converts most of the hugetlb.c
> globals into structure members and passes them around.
>
> Right now there is only a single global hstate structure, but
> most of the infrastructure to extend it is there.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> arch/ia64/mm/hugetlbpage.c | 2
> arch/powerpc/mm/hugetlbpage.c | 2
> arch/sh/mm/hugetlbpage.c | 2
> arch/sparc64/mm/hugetlbpage.c | 2
> arch/x86/mm/hugetlbpage.c | 2
> fs/hugetlbfs/inode.c | 45 +++---
> include/linux/hugetlb.h | 70 +++++++++
> ipc/shm.c | 3
> mm/hugetlb.c | 295 ++++++++++++++++++++++--------------------
> mm/memory.c | 2
> mm/mempolicy.c | 10 -
> mm/mmap.c | 3
> 12 files changed, 269 insertions(+), 169 deletions(-)
>
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -22,30 +22,24 @@
> #include "internal.h"
>
> const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> -static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
> -static unsigned long surplus_huge_pages;
> -static unsigned long nr_overcommit_huge_pages;
> unsigned long max_huge_pages;
> unsigned long sysctl_overcommit_huge_pages;
> -static struct list_head hugepage_freelists[MAX_NUMNODES];
> -static unsigned int nr_huge_pages_node[MAX_NUMNODES];
> -static unsigned int free_huge_pages_node[MAX_NUMNODES];
> -static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
> unsigned long hugepages_treat_as_movable;
> -static int hugetlb_next_nid;
> +
> +struct hstate global_hstate;
>

hstate isn't a particularly informative name as it's the state of what?
At a glance, someone may think it's a per-mount state where I am
expecting that multiple mounts using the same pagesize will share the
same pool.

Call it something liks struct hugepage_pool?

> /*
> * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
> */
> static DEFINE_SPINLOCK(hugetlb_lock);
>

This is not the patch to do it but it'll be worth looking at moving
hugetlb_lock into hstate later for workloads using different pagesizes
at the same time.

> -static void clear_huge_page(struct page *page, unsigned long addr)
> +static void clear_huge_page(struct page *page, unsigned long addr, unsigned sz)

hpage_size instead of sz to match the old define HPAGE_SIZE but to reflect
it is potentially no longer a constant?

That said, when calling clear_huge_page(), the caller has the VMA and could
pass a struct hstate * instead of sz here, more on what that may be useful
below.

> {
> int i;
>
> might_sleep();
> - for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); i++) {
> + for (i = 0; i < sz/PAGE_SIZE; i++) {

If you passed the hstate, and had a helper like

static inline int basepages_per_hpage(struct hstate *h)
{
return 1 << huge_page_order(h);
}

you could have i < basepages_per_hpage(h) here and use it in a number of
places throughout the patch. (suggestions on a better name are welcome)

sz/PAGE_SIZE is not very self-explanatory (hpage_size is a little easier)
and understanding what 1 << huge_page_order(h) means takes a little thought.

> cond_resched();
> clear_user_highpage(page + i, addr + i * PAGE_SIZE);
> }
> @@ -55,34 +49,35 @@ static void copy_huge_page(struct page *
> unsigned long addr, struct vm_area_struct *vma)
> {
> int i;
> + struct hstate *h = hstate_vma(vma);
>
> might_sleep();
> - for (i = 0; i < HPAGE_SIZE/PAGE_SIZE; i++) {
> + for (i = 0; i < 1 << huge_page_order(h); i++) {

basepages_per_hpage(h)

> cond_resched();
> copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
> }
> }
>
> -static void enqueue_huge_page(struct page *page)
> +static void enqueue_huge_page(struct hstate *h, struct page *page)
> {
> int nid = page_to_nid(page);
> - list_add(&page->lru, &hugepage_freelists[nid]);
> - free_huge_pages++;
> - free_huge_pages_node[nid]++;
> + list_add(&page->lru, &h->hugepage_freelists[nid]);
> + h->free_huge_pages++;
> + h->free_huge_pages_node[nid]++;

Equivilant code, looks fine.

> }
>
> -static struct page *dequeue_huge_page(void)
> +static struct page *dequeue_huge_page(struct hstate *h)
> {
> int nid;
> struct page *page = NULL;
>
> for (nid = 0; nid < MAX_NUMNODES; ++nid) {
> - if (!list_empty(&hugepage_freelists[nid])) {
> - page = list_entry(hugepage_freelists[nid].next,
> + if (!list_empty(&h->hugepage_freelists[nid])) {
> + page = list_entry(h->hugepage_freelists[nid].next,
> struct page, lru);
> list_del(&page->lru);
> - free_huge_pages--;
> - free_huge_pages_node[nid]--;
> + h->free_huge_pages--;
> + h->free_huge_pages_node[nid]--;
> break;

Equivilant code, looks fine.

> }
> }
> @@ -98,18 +93,19 @@ static struct page *dequeue_huge_page_vm
> struct zonelist *zonelist = huge_zonelist(vma, address,
> htlb_alloc_mask, &mpol);
> struct zone **z;
> + struct hstate *h = hstate_vma(vma);
>
> for (z = zonelist->zones; *z; z++) {
> nid = zone_to_nid(*z);
> if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
> - !list_empty(&hugepage_freelists[nid])) {
> - page = list_entry(hugepage_freelists[nid].next,
> + !list_empty(&h->hugepage_freelists[nid])) {
> + page = list_entry(h->hugepage_freelists[nid].next,
> struct page, lru);
> list_del(&page->lru);
> - free_huge_pages--;
> - free_huge_pages_node[nid]--;
> + h->free_huge_pages--;
> + h->free_huge_pages_node[nid]--;
> if (vma && vma->vm_flags & VM_MAYSHARE)
> - resv_huge_pages--;
> + h->resv_huge_pages--;
> break;

Equivilant code, looks fine.

> }
> }
> @@ -117,23 +113,24 @@ static struct page *dequeue_huge_page_vm
> return page;
> }
>
> -static void update_and_free_page(struct page *page)
> +static void update_and_free_page(struct hstate *h, struct page *page)
> {
> int i;
> - nr_huge_pages--;
> - nr_huge_pages_node[page_to_nid(page)]--;
> - for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
> + h->nr_huge_pages--;
> + h->nr_huge_pages_node[page_to_nid(page)]--;
> + for (i = 0; i < (1 << huge_page_order(h)); i++) {

basepages_per_hpage(h)

> page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
> 1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
> 1 << PG_private | 1<< PG_writeback);
> }
> set_compound_page_dtor(page, NULL);
> set_page_refcounted(page);
> - __free_pages(page, HUGETLB_PAGE_ORDER);
> + __free_pages(page, huge_page_order(h));

Otherwise, seems ok.

> }
>
> static void free_huge_page(struct page *page)
> {
> + struct hstate *h = &global_hstate;

hmm, when there are multiple struct hstates later, you are going to need to
distinguish between them otherwise pages of the wrong size will end up on the
wrong pool. As you are getting a compound page, I am guess you distinguish
based on size. This patch in isolation, it's fine but needs to be watched
for as if it is overlooked, it'll cause oopses or memory corruption when
too-small pages end up on the wrong pool and clear_huge_page() is called.

> int nid = page_to_nid(page);
> struct address_space *mapping;
>
> @@ -143,12 +140,12 @@ static void free_huge_page(struct page *
> INIT_LIST_HEAD(&page->lru);
>
> spin_lock(&hugetlb_lock);
> - if (surplus_huge_pages_node[nid]) {
> - update_and_free_page(page);
> - surplus_huge_pages--;
> - surplus_huge_pages_node[nid]--;
> + if (h->surplus_huge_pages_node[nid]) {
> + update_and_free_page(h, page);
> + h->surplus_huge_pages--;
> + h->surplus_huge_pages_node[nid]--;
> } else {
> - enqueue_huge_page(page);
> + enqueue_huge_page(h, page);
> }
> spin_unlock(&hugetlb_lock);
> if (mapping)
> @@ -160,7 +157,7 @@ static void free_huge_page(struct page *
> * balanced by operating on them in a round-robin fashion.
> * Returns 1 if an adjustment was made.
> */
> -static int adjust_pool_surplus(int delta)
> +static int adjust_pool_surplus(struct hstate *h, int delta)
> {
> static int prev_nid;
> int nid = prev_nid;
> @@ -173,15 +170,15 @@ static int adjust_pool_surplus(int delta
> nid = first_node(node_online_map);
>
> /* To shrink on this node, there must be a surplus page */
> - if (delta < 0 && !surplus_huge_pages_node[nid])
> + if (delta < 0 && !h->surplus_huge_pages_node[nid])
> continue;
> /* Surplus cannot exceed the total number of pages */
> - if (delta > 0 && surplus_huge_pages_node[nid] >=
> - nr_huge_pages_node[nid])
> + if (delta > 0 && h->surplus_huge_pages_node[nid] >=
> + h->nr_huge_pages_node[nid])
> continue;
>
> - surplus_huge_pages += delta;
> - surplus_huge_pages_node[nid] += delta;
> + h->surplus_huge_pages += delta;
> + h->surplus_huge_pages_node[nid] += delta;
> ret = 1;

Looks equivilant.

> break;
> } while (nid != prev_nid);
> @@ -190,18 +187,18 @@ static int adjust_pool_surplus(int delta
> return ret;
> }
>
> -static struct page *alloc_fresh_huge_page_node(int nid)
> +static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> {
> struct page *page;
>
> page = alloc_pages_node(nid,
> htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
> - HUGETLB_PAGE_ORDER);
> + huge_page_order(h));

nit, change in indenting here for no apparent reason.

> if (page) {
> set_compound_page_dtor(page, free_huge_page);
> spin_lock(&hugetlb_lock);
> - nr_huge_pages++;
> - nr_huge_pages_node[nid]++;
> + h->nr_huge_pages++;
> + h->nr_huge_pages_node[nid]++;
> spin_unlock(&hugetlb_lock);
> put_page(page); /* free it into the hugepage allocator */
> }
> @@ -209,17 +206,17 @@ static struct page *alloc_fresh_huge_pag
> return page;
> }
>
> -static int alloc_fresh_huge_page(void)
> +static int alloc_fresh_huge_page(struct hstate *h)
> {
> struct page *page;
> int start_nid;
> int next_nid;
> int ret = 0;
>
> - start_nid = hugetlb_next_nid;
> + start_nid = h->hugetlb_next_nid;
>
> do {
> - page = alloc_fresh_huge_page_node(hugetlb_next_nid);
> + page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
> if (page)
> ret = 1;
> /*
> @@ -233,17 +230,18 @@ static int alloc_fresh_huge_page(void)
> * if we just successfully allocated a hugepage so that
> * the next caller gets hugepages on the next node.
> */
> - next_nid = next_node(hugetlb_next_nid, node_online_map);
> + next_nid = next_node(h->hugetlb_next_nid, node_online_map);
> if (next_nid == MAX_NUMNODES)
> next_nid = first_node(node_online_map);
> - hugetlb_next_nid = next_nid;
> - } while (!page && hugetlb_next_nid != start_nid);
> + h->hugetlb_next_nid = next_nid;
> + } while (!page && h->hugetlb_next_nid != start_nid);
>

Equivilant code, seems fine.

> return ret;
> }
>
> -static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
> - unsigned long address)
> +static struct page *alloc_buddy_huge_page(struct hstate *h,
> + struct vm_area_struct *vma,
> + unsigned long address)
> {
> struct page *page;
> unsigned int nid;
> @@ -272,17 +270,17 @@ static struct page *alloc_buddy_huge_pag
> * per-node value is checked there.
> */
> spin_lock(&hugetlb_lock);
> - if (surplus_huge_pages >= nr_overcommit_huge_pages) {
> + if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) {
> spin_unlock(&hugetlb_lock);
> return NULL;
> } else {
> - nr_huge_pages++;
> - surplus_huge_pages++;
> + h->nr_huge_pages++;
> + h->surplus_huge_pages++;
> }
> spin_unlock(&hugetlb_lock);
>
> page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
> - HUGETLB_PAGE_ORDER);
> + huge_page_order(h));
>

Unnecessary change in whitespace there.

> spin_lock(&hugetlb_lock);
> if (page) {
> @@ -291,11 +289,11 @@ static struct page *alloc_buddy_huge_pag
> /*
> * We incremented the global counters already
> */
> - nr_huge_pages_node[nid]++;
> - surplus_huge_pages_node[nid]++;
> + h->nr_huge_pages_node[nid]++;
> + h->surplus_huge_pages_node[nid]++;
> } else {
> - nr_huge_pages--;
> - surplus_huge_pages--;
> + h->nr_huge_pages--;
> + h->surplus_huge_pages--;
> }
> spin_unlock(&hugetlb_lock);
>

Seems ok.

> @@ -306,16 +304,16 @@ static struct page *alloc_buddy_huge_pag
> * Increase the hugetlb pool such that it can accomodate a reservation
> * of size 'delta'.
> */
> -static int gather_surplus_pages(int delta)
> +static int gather_surplus_pages(struct hstate *h, int delta)
> {
> struct list_head surplus_list;
> struct page *page, *tmp;
> int ret, i;
> int needed, allocated;
>
> - needed = (resv_huge_pages + delta) - free_huge_pages;
> + needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
> if (needed <= 0) {
> - resv_huge_pages += delta;
> + h->resv_huge_pages += delta;
> return 0;
> }
>
> @@ -326,7 +324,7 @@ static int gather_surplus_pages(int delt
> retry:
> spin_unlock(&hugetlb_lock);
> for (i = 0; i < needed; i++) {
> - page = alloc_buddy_huge_page(NULL, 0);
> + page = alloc_buddy_huge_page(h, NULL, 0);
> if (!page) {
> /*
> * We were not able to allocate enough pages to
> @@ -347,7 +345,8 @@ retry:
> * because either resv_huge_pages or free_huge_pages may have changed.
> */
> spin_lock(&hugetlb_lock);
> - needed = (resv_huge_pages + delta) - (free_huge_pages + allocated);
> + needed = (h->resv_huge_pages + delta) -
> + (h->free_huge_pages + allocated);
> if (needed > 0)
> goto retry;
>
> @@ -360,13 +359,13 @@ retry:
> * before they are reserved.
> */
> needed += allocated;
> - resv_huge_pages += delta;
> + h->resv_huge_pages += delta;
> ret = 0;
> free:
> list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
> list_del(&page->lru);
> if ((--needed) >= 0)
> - enqueue_huge_page(page);
> + enqueue_huge_page(h, page);
> else {
> /*
> * Decrement the refcount and free the page using its
> @@ -388,34 +387,35 @@ free:
> * allocated to satisfy the reservation must be explicitly freed if they were
> * never used.
> */
> -static void return_unused_surplus_pages(unsigned long unused_resv_pages)
> +static void
> +return_unused_surplus_pages(struct hstate *h, unsigned long unused_resv_pages)
> {
> static int nid = -1;
> struct page *page;
> unsigned long nr_pages;
>
> /* Uncommit the reservation */
> - resv_huge_pages -= unused_resv_pages;
> + h->resv_huge_pages -= unused_resv_pages;
>
> - nr_pages = min(unused_resv_pages, surplus_huge_pages);
> + nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
>
> while (nr_pages) {
> nid = next_node(nid, node_online_map);
> if (nid == MAX_NUMNODES)
> nid = first_node(node_online_map);
>
> - if (!surplus_huge_pages_node[nid])
> + if (!h->surplus_huge_pages_node[nid])
> continue;
>
> - if (!list_empty(&hugepage_freelists[nid])) {
> - page = list_entry(hugepage_freelists[nid].next,
> + if (!list_empty(&h->hugepage_freelists[nid])) {
> + page = list_entry(h->hugepage_freelists[nid].next,
> struct page, lru);
> list_del(&page->lru);
> - update_and_free_page(page);
> - free_huge_pages--;
> - free_huge_pages_node[nid]--;
> - surplus_huge_pages--;
> - surplus_huge_pages_node[nid]--;
> + update_and_free_page(h, page);
> + h->free_huge_pages--;
> + h->free_huge_pages_node[nid]--;
> + h->surplus_huge_pages--;
> + h->surplus_huge_pages_node[nid]--;
> nr_pages--;

Seems ok.

> }
> }
> @@ -437,16 +437,17 @@ static struct page *alloc_huge_page_priv
> unsigned long addr)
> {
> struct page *page = NULL;
> + struct hstate *h = hstate_vma(vma);
>
> if (hugetlb_get_quota(vma->vm_file->f_mapping, 1))
> return ERR_PTR(-VM_FAULT_SIGBUS);
>
> spin_lock(&hugetlb_lock);
> - if (free_huge_pages > resv_huge_pages)
> + if (h->free_huge_pages > h->resv_huge_pages)
> page = dequeue_huge_page_vma(vma, addr);
> spin_unlock(&hugetlb_lock);
> if (!page) {
> - page = alloc_buddy_huge_page(vma, addr);
> + page = alloc_buddy_huge_page(h, vma, addr);
> if (!page) {
> hugetlb_put_quota(vma->vm_file->f_mapping, 1);
> return ERR_PTR(-VM_FAULT_OOM);
> @@ -476,21 +477,27 @@ static struct page *alloc_huge_page(stru
> static int __init hugetlb_init(void)
> {
> unsigned long i;
> + struct hstate *h = &global_hstate;
>

Similar comment to free_huge_page(), if more than two hugepage sizes exist,
the boot paramters will need to distinguish which pool is being referred to.
global_hstate would be replaced by the default_hugepage_pool here I would
guess.

> if (HPAGE_SHIFT == 0)
> return 0;
>
> + if (!h->order) {
> + h->order = HPAGE_SHIFT - PAGE_SHIFT;
> + h->mask = HPAGE_MASK;
> + }

Unwritten assumption here that HPAGE_SIZE != PAGE_SIZE. Probably a safe
assumption though.

WARN_ON_ONCE(HPAGE_SHIFT == PAGE_SHIFT) just in case?

> +
> for (i = 0; i < MAX_NUMNODES; ++i)
> - INIT_LIST_HEAD(&hugepage_freelists[i]);
> + INIT_LIST_HEAD(&h->hugepage_freelists[i]);
>
> - hugetlb_next_nid = first_node(node_online_map);
> + h->hugetlb_next_nid = first_node(node_online_map);
>
> for (i = 0; i < max_huge_pages; ++i) {
> - if (!alloc_fresh_huge_page())
> + if (!alloc_fresh_huge_page(h))
> break;
> }
> - max_huge_pages = free_huge_pages = nr_huge_pages = i;
> - printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
> + max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> + printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);

hmm, unrelated to this patch but that printk() is misleading. The language
implies it is size in bytes but the value is in pages. As you are changing the
code anyway, do you care to print out the size of the pages being allocated
and change the language to show it's pages being printed, not bytes?

> return 0;
> }
> module_init(hugetlb_init);
> @@ -518,19 +525,21 @@ static unsigned int cpuset_mems_nr(unsig
> #ifdef CONFIG_HIGHMEM
> static void try_to_free_low(unsigned long count)
> {
> + struct hstate *h = &global_hstate;

Similar comments to free_huge_page(), will need to select a hstate
differently later.

> int i;
>
> for (i = 0; i < MAX_NUMNODES; ++i) {
> struct page *page, *next;
> - list_for_each_entry_safe(page, next, &hugepage_freelists[i], lru) {
> + struct list_head *freel = &h->hugepage_freelists[i];
> + list_for_each_entry_safe(page, next, freel, lru) {
> if (count >= nr_huge_pages)
> return;
> if (PageHighMem(page))
> continue;
> list_del(&page->lru);
> update_and_free_page(page);
> - free_huge_pages--;
> - free_huge_pages_node[page_to_nid(page)]--;
> + h->free_huge_pages--;
> + h->free_huge_pages_node[page_to_nid(page)]--;
> }
> }
> }
> @@ -540,10 +549,11 @@ static inline void try_to_free_low(unsig
> }
> #endif
>
> -#define persistent_huge_pages (nr_huge_pages - surplus_huge_pages)
> +#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)

Should this be moved to the other hstate-related helpers?

> static unsigned long set_max_huge_pages(unsigned long count)
> {
> unsigned long min_count, ret;
> + struct hstate *h = &global_hstate;
>

Same story about multiple hstates. I think for this patch if you had a helper
similar to hstate_vma()/hstate_file()/hstate_inode() called hstate_pagesize()
that returned &global_hstate, it would help. The assumption would be with
multiple pagesizes later that you would have either a VMA or a pagesize to
work with.

In general, the proc interface to this is going to need changing later to
handle different sizes. I suspect it is not of urgency to you as you are
likely filling 1GB pages at boot-time only in this patchset to avoid the
problem of allocating pages of orders >= MAX_ORDER.

> /*
> * Increase the pool size
> @@ -557,12 +567,12 @@ static unsigned long set_max_huge_pages(
> * within all the constraints specified by the sysctls.
> */
> spin_lock(&hugetlb_lock);
> - while (surplus_huge_pages && count > persistent_huge_pages) {
> - if (!adjust_pool_surplus(-1))
> + while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
> + if (!adjust_pool_surplus(h, -1))
> break;
> }
>
> - while (count > persistent_huge_pages) {
> + while (count > persistent_huge_pages(h)) {
> int ret;
> /*
> * If this allocation races such that we no longer need the
> @@ -570,7 +580,7 @@ static unsigned long set_max_huge_pages(
> * and reducing the surplus.
> */
> spin_unlock(&hugetlb_lock);
> - ret = alloc_fresh_huge_page();
> + ret = alloc_fresh_huge_page(h);
> spin_lock(&hugetlb_lock);
> if (!ret)
> goto out;
> @@ -592,21 +602,21 @@ static unsigned long set_max_huge_pages(
> * and won't grow the pool anywhere else. Not until one of the
> * sysctls are changed, or the surplus pages go out of use.
> */
> - min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
> + min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
> min_count = max(count, min_count);
> try_to_free_low(min_count);
> - while (min_count < persistent_huge_pages) {
> - struct page *page = dequeue_huge_page();
> + while (min_count < persistent_huge_pages(h)) {
> + struct page *page = dequeue_huge_page(h);
> if (!page)
> break;
> - update_and_free_page(page);
> + update_and_free_page(h, page);
> }
> - while (count < persistent_huge_pages) {
> - if (!adjust_pool_surplus(1))
> + while (count < persistent_huge_pages(h)) {
> + if (!adjust_pool_surplus(h, 1))
> break;
> }
> out:
> - ret = persistent_huge_pages;
> + ret = persistent_huge_pages(h);
> spin_unlock(&hugetlb_lock);
> return ret;
> }
> @@ -636,9 +646,10 @@ int hugetlb_overcommit_handler(struct ct
> struct file *file, void __user *buffer,
> size_t *length, loff_t *ppos)
> {
> + struct hstate *h = &global_hstate;
> proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> spin_lock(&hugetlb_lock);
> - nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
> + h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
> spin_unlock(&hugetlb_lock);
> return 0;
> }
> @@ -647,32 +658,35 @@ int hugetlb_overcommit_handler(struct ct
>
> int hugetlb_report_meminfo(char *buf)
> {
> + struct hstate *h = &global_hstate;
> return sprintf(buf,
> "HugePages_Total: %5lu\n"
> "HugePages_Free: %5lu\n"
> "HugePages_Rsvd: %5lu\n"
> "HugePages_Surp: %5lu\n"
> "Hugepagesize: %5lu kB\n",
> - nr_huge_pages,
> - free_huge_pages,
> - resv_huge_pages,
> - surplus_huge_pages,
> - HPAGE_SIZE/1024);
> + h->nr_huge_pages,
> + h->free_huge_pages,
> + h->resv_huge_pages,
> + h->surplus_huge_pages,
> + 1UL << (huge_page_order(h) + PAGE_SHIFT - 10));

I've taken a note to see how you report meminfo if more than one hstate
ever exists.

> }
>
> int hugetlb_report_node_meminfo(int nid, char *buf)
> {
> + struct hstate *h = &global_hstate;
> return sprintf(buf,
> "Node %d HugePages_Total: %5u\n"
> "Node %d HugePages_Free: %5u\n",
> - nid, nr_huge_pages_node[nid],
> - nid, free_huge_pages_node[nid]);
> + nid, h->nr_huge_pages_node[nid],
> + nid, h->free_huge_pages_node[nid]);
> }
>
> /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
> unsigned long hugetlb_total_pages(void)
> {
> - return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE);
> + struct hstate *h = &global_hstate;
> + return h->nr_huge_pages * (1 << huge_page_order(h));
> }
>
> /*
> @@ -727,14 +741,16 @@ int copy_hugetlb_page_range(struct mm_st
> struct page *ptepage;
> unsigned long addr;
> int cow;
> + struct hstate *h = hstate_vma(vma);
> + unsigned sz = huge_page_size(h);

I would prefer hpage_size instead of sz to match up with HPAGE_SIZE.
Same applies for all future uses of sz.

>
> cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>
> - for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
> + for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
> src_pte = huge_pte_offset(src, addr);
> if (!src_pte)
> continue;
> - dst_pte = huge_pte_alloc(dst, addr);
> + dst_pte = huge_pte_alloc(dst, addr, sz);
> if (!dst_pte)
> goto nomem;
>
> @@ -770,6 +786,9 @@ void __unmap_hugepage_range(struct vm_ar
> pte_t pte;
> struct page *page;
> struct page *tmp;
> + struct hstate *h = hstate_vma(vma);
> + unsigned sz = huge_page_size(h);
> +
> /*
> * A page gathering list, protected by per file i_mmap_lock. The
> * lock is used to avoid list corruption from multiple unmapping
> @@ -778,11 +797,11 @@ void __unmap_hugepage_range(struct vm_ar
> LIST_HEAD(page_list);
>
> WARN_ON(!is_vm_hugetlb_page(vma));
> - BUG_ON(start & ~HPAGE_MASK);
> - BUG_ON(end & ~HPAGE_MASK);
> + BUG_ON(start & ~huge_page_mask(h));
> + BUG_ON(end & ~huge_page_mask(h));
>
> spin_lock(&mm->page_table_lock);
> - for (address = start; address < end; address += HPAGE_SIZE) {
> + for (address = start; address < end; address += sz) {
> ptep = huge_pte_offset(mm, address);
> if (!ptep)
> continue;
> @@ -830,6 +849,7 @@ static int hugetlb_cow(struct mm_struct
> {
> struct page *old_page, *new_page;
> int avoidcopy;
> + struct hstate *h = hstate_vma(vma);
>
> old_page = pte_page(pte);
>
> @@ -854,7 +874,7 @@ static int hugetlb_cow(struct mm_struct
> __SetPageUptodate(new_page);
> spin_lock(&mm->page_table_lock);
>
> - ptep = huge_pte_offset(mm, address & HPAGE_MASK);
> + ptep = huge_pte_offset(mm, address & huge_page_mask(h));
> if (likely(pte_same(*ptep, pte))) {
> /* Break COW */
> set_huge_pte_at(mm, address, ptep,
> @@ -876,10 +896,11 @@ static int hugetlb_no_page(struct mm_str
> struct page *page;
> struct address_space *mapping;
> pte_t new_pte;
> + struct hstate *h = hstate_vma(vma);
>
> mapping = vma->vm_file->f_mapping;
> - idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
> - + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
> + idx = ((address - vma->vm_start) >> huge_page_shift(h))
> + + (vma->vm_pgoff >> huge_page_order(h));
>
> /*
> * Use page lock to guard against racing truncation
> @@ -888,7 +909,7 @@ static int hugetlb_no_page(struct mm_str
> retry:
> page = find_lock_page(mapping, idx);
> if (!page) {
> - size = i_size_read(mapping->host) >> HPAGE_SHIFT;
> + size = i_size_read(mapping->host) >> huge_page_shift(h);
> if (idx >= size)
> goto out;
> page = alloc_huge_page(vma, address);
> @@ -896,7 +917,7 @@ retry:
> ret = -PTR_ERR(page);
> goto out;
> }
> - clear_huge_page(page, address);
> + clear_huge_page(page, address, huge_page_size(h));
> __SetPageUptodate(page);
>
> if (vma->vm_flags & VM_SHARED) {
> @@ -912,14 +933,14 @@ retry:
> }
>
> spin_lock(&inode->i_lock);
> - inode->i_blocks += BLOCKS_PER_HUGEPAGE;
> + inode->i_blocks += (huge_page_size(h)) / 512;

Magic number alert. Do you need to replace BLOCKS_PER_HUGEPAGE with a
blocks_per_hugepage(h) helper?

static inline blocks_per_hugepage(struct hstate *h)
{
return huge_page_size(h) / 512;
}

> spin_unlock(&inode->i_lock);
> } else
> lock_page(page);
> }
>
> spin_lock(&mm->page_table_lock);
> - size = i_size_read(mapping->host) >> HPAGE_SHIFT;
> + size = i_size_read(mapping->host) >> huge_page_shift(h);
> if (idx >= size)
> goto backout;
>
> @@ -955,8 +976,9 @@ int hugetlb_fault(struct mm_struct *mm,
> pte_t entry;
> int ret;
> static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> + struct hstate *h = hstate_vma(vma);
>
> - ptep = huge_pte_alloc(mm, address);
> + ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> if (!ptep)
> return VM_FAULT_OOM;
>
> @@ -994,6 +1016,7 @@ int follow_hugetlb_page(struct mm_struct
> unsigned long pfn_offset;
> unsigned long vaddr = *position;
> int remainder = *length;
> + struct hstate *h = hstate_vma(vma);
>
> spin_lock(&mm->page_table_lock);
> while (vaddr < vma->vm_end && remainder) {
> @@ -1005,7 +1028,7 @@ int follow_hugetlb_page(struct mm_struct
> * each hugepage. We have to make * sure we get the
> * first, for the page indexing below to work.
> */
> - pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);
> + pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
>
> if (!pte || pte_none(*pte) || (write && !pte_write(*pte))) {
> int ret;
> @@ -1022,7 +1045,7 @@ int follow_hugetlb_page(struct mm_struct
> break;
> }
>
> - pfn_offset = (vaddr & ~HPAGE_MASK) >> PAGE_SHIFT;
> + pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> page = pte_page(*pte);
> same_page:
> if (pages) {
> @@ -1038,7 +1061,7 @@ same_page:
> --remainder;
> ++i;
> if (vaddr < vma->vm_end && remainder &&
> - pfn_offset < HPAGE_SIZE/PAGE_SIZE) {
> + pfn_offset < (1 << huge_page_order(h))) {

basepages_per_hpage(h)

> /*
> * We use pfn_offset to avoid touching the pageframes
> * of this compound page.
> @@ -1060,13 +1083,14 @@ void hugetlb_change_protection(struct vm
> unsigned long start = address;
> pte_t *ptep;
> pte_t pte;
> + struct hstate *h = hstate_vma(vma);
>
> BUG_ON(address >= end);
> flush_cache_range(vma, address, end);
>
> spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
> spin_lock(&mm->page_table_lock);
> - for (; address < end; address += HPAGE_SIZE) {
> + for (; address < end; address += huge_page_size(h)) {
> ptep = huge_pte_offset(mm, address);
> if (!ptep)
> continue;
> @@ -1205,7 +1229,7 @@ static long region_truncate(struct list_
> return chg;
> }
>
> -static int hugetlb_acct_memory(long delta)
> +static int hugetlb_acct_memory(struct hstate *h, long delta)
> {
> int ret = -ENOMEM;
>
> @@ -1228,18 +1252,18 @@ static int hugetlb_acct_memory(long delt
> * semantics that cpuset has.
> */
> if (delta > 0) {
> - if (gather_surplus_pages(delta) < 0)
> + if (gather_surplus_pages(h, delta) < 0)
> goto out;
>
> - if (delta > cpuset_mems_nr(free_huge_pages_node)) {
> - return_unused_surplus_pages(delta);
> + if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
> + return_unused_surplus_pages(h, delta);
> goto out;
> }
> }
>
> ret = 0;
> if (delta < 0)
> - return_unused_surplus_pages((unsigned long) -delta);
> + return_unused_surplus_pages(h, (unsigned long) -delta);
>
> out:
> spin_unlock(&hugetlb_lock);
> @@ -1249,6 +1273,7 @@ out:
> int hugetlb_reserve_pages(struct inode *inode, long from, long to)
> {
> long ret, chg;
> + struct hstate *h = &global_hstate;
>
> chg = region_chg(&inode->i_mapping->private_list, from, to);
> if (chg < 0)
> @@ -1256,7 +1281,7 @@ int hugetlb_reserve_pages(struct inode *
>
> if (hugetlb_get_quota(inode->i_mapping, chg))
> return -ENOSPC;
> - ret = hugetlb_acct_memory(chg);
> + ret = hugetlb_acct_memory(h, chg);
> if (ret < 0) {
> hugetlb_put_quota(inode->i_mapping, chg);
> return ret;
> @@ -1267,12 +1292,13 @@ int hugetlb_reserve_pages(struct inode *
>
> void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
> {
> + struct hstate *h = &global_hstate;
> long chg = region_truncate(&inode->i_mapping->private_list, offset);
>
> spin_lock(&inode->i_lock);
> - inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
> + inode->i_blocks -= ((huge_page_size(h))/512) * freed;
> spin_unlock(&inode->i_lock);
>
> hugetlb_put_quota(inode->i_mapping, (chg - freed));
> - hugetlb_acct_memory(-(chg - freed));
> + hugetlb_acct_memory(h, -(chg - freed));
> }
> Index: linux/arch/powerpc/mm/hugetlbpage.c
> ===================================================================
> --- linux.orig/arch/powerpc/mm/hugetlbpage.c
> +++ linux/arch/powerpc/mm/hugetlbpage.c
> @@ -128,7 +128,7 @@ pte_t *huge_pte_offset(struct mm_struct
> return NULL;
> }
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
> {
> pgd_t *pg;
> pud_t *pu;
> Index: linux/arch/sparc64/mm/hugetlbpage.c
> ===================================================================
> --- linux.orig/arch/sparc64/mm/hugetlbpage.c
> +++ linux/arch/sparc64/mm/hugetlbpage.c
> @@ -195,7 +195,7 @@ hugetlb_get_unmapped_area(struct file *f
> pgoff, flags);
> }
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
> {
> pgd_t *pgd;
> pud_t *pud;
> Index: linux/arch/sh/mm/hugetlbpage.c
> ===================================================================
> --- linux.orig/arch/sh/mm/hugetlbpage.c
> +++ linux/arch/sh/mm/hugetlbpage.c
> @@ -22,7 +22,7 @@
> #include <asm/tlbflush.h>
> #include <asm/cacheflush.h>
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
> {
> pgd_t *pgd;
> pud_t *pud;
> Index: linux/arch/ia64/mm/hugetlbpage.c
> ===================================================================
> --- linux.orig/arch/ia64/mm/hugetlbpage.c
> +++ linux/arch/ia64/mm/hugetlbpage.c
> @@ -24,7 +24,7 @@
> unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;
>
> pte_t *
> -huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
> +huge_pte_alloc (struct mm_struct *mm, unsigned long addr, int sz)
> {
> unsigned long taddr = htlbpage_to_page(addr);
> pgd_t *pgd;
> Index: linux/arch/x86/mm/hugetlbpage.c
> ===================================================================
> --- linux.orig/arch/x86/mm/hugetlbpage.c
> +++ linux/arch/x86/mm/hugetlbpage.c
> @@ -124,7 +124,7 @@ int huge_pmd_unshare(struct mm_struct *m
> return 1;
> }
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz)
> {
> pgd_t *pgd;
> pud_t *pud;
> Index: linux/include/linux/hugetlb.h
> ===================================================================
> --- linux.orig/include/linux/hugetlb.h
> +++ linux/include/linux/hugetlb.h
> @@ -40,7 +40,7 @@ extern int sysctl_hugetlb_shm_group;
>
> /* arch callbacks */
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
> +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, int sz);
> pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
> int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
> struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> @@ -95,7 +95,6 @@ pte_t huge_ptep_get_and_clear(struct mm_
> #else
> void hugetlb_prefault_arch_hook(struct mm_struct *mm);
> #endif
> -

Spurious whitespace change there.

> #else /* !CONFIG_HUGETLB_PAGE */
>
> static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
> @@ -169,8 +168,6 @@ struct file *hugetlb_file_setup(const ch
> int hugetlb_get_quota(struct address_space *mapping, long delta);
> void hugetlb_put_quota(struct address_space *mapping, long delta);
>
> -#define BLOCKS_PER_HUGEPAGE (HPAGE_SIZE / 512)
> -

ah, you remove BLOCKS_PER_HUGEPAGE all right. Just needs to be replaced
with a helper or there will be headscratching over that 512 later.

> static inline int is_file_hugepages(struct file *file)
> {
> if (file->f_op == &hugetlbfs_file_operations)
> @@ -199,4 +196,69 @@ unsigned long hugetlb_get_unmapped_area(
> unsigned long flags);
> #endif /* HAVE_ARCH_HUGETLB_UNMAPPED_AREA */
>
> +#ifdef CONFIG_HUGETLB_PAGE
> +
> +/* Defines one hugetlb page size */
> +struct hstate {
> + int hugetlb_next_nid;
> + short order;
> + /* 2 bytes free */
> + unsigned long mask;
> + unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages;
> + unsigned long surplus_huge_pages;
> + unsigned long nr_overcommit_huge_pages;
> + struct list_head hugepage_freelists[MAX_NUMNODES];
> + unsigned int nr_huge_pages_node[MAX_NUMNODES];
> + unsigned int free_huge_pages_node[MAX_NUMNODES];
> + unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> +};
> +
> +extern struct hstate global_hstate;
> +
> +static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> +{
> + return &global_hstate;
> +}
> +
> +static inline struct hstate *hstate_file(struct file *f)
> +{
> + return &global_hstate;
> +}
> +
> +static inline struct hstate *hstate_inode(struct inode *i)
> +{
> + return &global_hstate;
> +}
> +
> +static inline unsigned huge_page_size(struct hstate *h)
> +{
> + return PAGE_SIZE << h->order;
> +}
> +
> +static inline unsigned long huge_page_mask(struct hstate *h)
> +{
> + return h->mask;
> +}
> +
> +static inline unsigned long huge_page_order(struct hstate *h)
> +{
> + return h->order;
> +}
> +
> +static inline unsigned huge_page_shift(struct hstate *h)
> +{
> + return h->order + PAGE_SHIFT;
> +}

Typically, you are replacing defines like HPAGE_SIZE with
huge_page_size(). I think it would be an easier change overall if
constants like this were replaced with helpers in lowercase. i.e.

HPAGE_SIZE -> hpage_size(h)
HPAGE_SHIFT -> hpage_shift(h)

etc. It would make parts of this patch easier to read.

> +
> +#else
> +struct hstate {};
> +#define hstate_file(f) NULL
> +#define hstate_vma(v) NULL
> +#define hstate_inode(i) NULL
> +#define huge_page_size(h) PAGE_SIZE
> +#define huge_page_mask(h) PAGE_MASK
> +#define huge_page_order(h) 0
> +#define huge_page_shift(h) PAGE_SHIFT
> +#endif

Is it not typical that #defines like this be replaced by equivilant
static inlines?

> +
> #endif /* _LINUX_HUGETLB_H */
> Index: linux/fs/hugetlbfs/inode.c
> ===================================================================
> --- linux.orig/fs/hugetlbfs/inode.c
> +++ linux/fs/hugetlbfs/inode.c
> @@ -80,6 +80,7 @@ static int hugetlbfs_file_mmap(struct fi
> struct inode *inode = file->f_path.dentry->d_inode;
> loff_t len, vma_len;
> int ret;
> + struct hstate *h = hstate_file(file);
>
> /*
> * vma address alignment (but not the pgoff alignment) has
> @@ -92,7 +93,7 @@ static int hugetlbfs_file_mmap(struct fi
> vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
> vma->vm_ops = &hugetlb_vm_ops;
>
> - if (vma->vm_pgoff & ~(HPAGE_MASK >> PAGE_SHIFT))
> + if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
> return -EINVAL;
>
> vma_len = (loff_t)(vma->vm_end - vma->vm_start);
> @@ -104,8 +105,8 @@ static int hugetlbfs_file_mmap(struct fi
> len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>
> if (vma->vm_flags & VM_MAYSHARE &&
> - hugetlb_reserve_pages(inode, vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT),
> - len >> HPAGE_SHIFT))
> + hugetlb_reserve_pages(inode, vma->vm_pgoff >> huge_page_order(h),
> + len >> huge_page_shift(h)))
> goto out;
>
> ret = 0;
> @@ -130,8 +131,9 @@ hugetlb_get_unmapped_area(struct file *f
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma;
> unsigned long start_addr;
> + struct hstate *h = hstate_file(file);
>
> - if (len & ~HPAGE_MASK)
> + if (len & ~huge_page_mask(h))
> return -EINVAL;
> if (len > TASK_SIZE)
> return -ENOMEM;
> @@ -143,7 +145,7 @@ hugetlb_get_unmapped_area(struct file *f
> }
>
> if (addr) {
> - addr = ALIGN(addr, HPAGE_SIZE);
> + addr = ALIGN(addr, huge_page_size(h));
> vma = find_vma(mm, addr);
> if (TASK_SIZE - len >= addr &&
> (!vma || addr + len <= vma->vm_start))
> @@ -156,7 +158,7 @@ hugetlb_get_unmapped_area(struct file *f
> start_addr = TASK_UNMAPPED_BASE;
>
> full_search:
> - addr = ALIGN(start_addr, HPAGE_SIZE);
> + addr = ALIGN(start_addr, huge_page_size(h));
>
> for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
> /* At this point: (!vma || addr < vma->vm_end). */
> @@ -174,7 +176,7 @@ full_search:
>
> if (!vma || addr + len <= vma->vm_start)
> return addr;
> - addr = ALIGN(vma->vm_end, HPAGE_SIZE);
> + addr = ALIGN(vma->vm_end, huge_page_size(h));
> }
> }
> #endif
> @@ -225,10 +227,11 @@ hugetlbfs_read_actor(struct page *page,
> static ssize_t hugetlbfs_read(struct file *filp, char __user *buf,
> size_t len, loff_t *ppos)
> {
> + struct hstate *h = hstate_file(filp);
> struct address_space *mapping = filp->f_mapping;
> struct inode *inode = mapping->host;
> - unsigned long index = *ppos >> HPAGE_SHIFT;
> - unsigned long offset = *ppos & ~HPAGE_MASK;
> + unsigned long index = *ppos >> huge_page_shift(h);
> + unsigned long offset = *ppos & ~huge_page_mask(h);
> unsigned long end_index;
> loff_t isize;
> ssize_t retval = 0;
> @@ -243,17 +246,17 @@ static ssize_t hugetlbfs_read(struct fil
> if (!isize)
> goto out;
>
> - end_index = (isize - 1) >> HPAGE_SHIFT;
> + end_index = (isize - 1) >> huge_page_shift(h);
> for (;;) {
> struct page *page;
> int nr, ret;
>
> /* nr is the maximum number of bytes to copy from this page */
> - nr = HPAGE_SIZE;
> + nr = huge_page_size(h);
> if (index >= end_index) {
> if (index > end_index)
> goto out;
> - nr = ((isize - 1) & ~HPAGE_MASK) + 1;
> + nr = ((isize - 1) & ~huge_page_mask(h)) + 1;
> if (nr <= offset) {
> goto out;
> }
> @@ -287,8 +290,8 @@ static ssize_t hugetlbfs_read(struct fil
> offset += ret;
> retval += ret;
> len -= ret;
> - index += offset >> HPAGE_SHIFT;
> - offset &= ~HPAGE_MASK;
> + index += offset >> huge_page_shift(h);
> + offset &= ~huge_page_mask(h);
>
> if (page)
> page_cache_release(page);
> @@ -298,7 +301,7 @@ static ssize_t hugetlbfs_read(struct fil
> break;
> }
> out:
> - *ppos = ((loff_t)index << HPAGE_SHIFT) + offset;
> + *ppos = ((loff_t)index << huge_page_shift(h)) + offset;
> mutex_unlock(&inode->i_mutex);
> return retval;
> }
> @@ -339,8 +342,9 @@ static void truncate_huge_page(struct pa
>
> static void truncate_hugepages(struct inode *inode, loff_t lstart)
> {
> + struct hstate *h = hstate_inode(inode);
> struct address_space *mapping = &inode->i_data;
> - const pgoff_t start = lstart >> HPAGE_SHIFT;
> + const pgoff_t start = lstart >> huge_page_shift(h);
> struct pagevec pvec;
> pgoff_t next;
> int i, freed = 0;
> @@ -449,8 +453,9 @@ static int hugetlb_vmtruncate(struct ino
> {
> pgoff_t pgoff;
> struct address_space *mapping = inode->i_mapping;
> + struct hstate *h = hstate_inode(inode);
>
> - BUG_ON(offset & ~HPAGE_MASK);
> + BUG_ON(offset & ~huge_page_mask(h));
> pgoff = offset >> PAGE_SHIFT;
>
> i_size_write(inode, offset);
> @@ -465,6 +470,7 @@ static int hugetlb_vmtruncate(struct ino
> static int hugetlbfs_setattr(struct dentry *dentry, struct iattr *attr)
> {
> struct inode *inode = dentry->d_inode;
> + struct hstate *h = hstate_inode(inode);
> int error;
> unsigned int ia_valid = attr->ia_valid;
>
> @@ -476,7 +482,7 @@ static int hugetlbfs_setattr(struct dent
>
> if (ia_valid & ATTR_SIZE) {
> error = -EINVAL;
> - if (!(attr->ia_size & ~HPAGE_MASK))
> + if (!(attr->ia_size & ~huge_page_mask(h)))
> error = hugetlb_vmtruncate(inode, attr->ia_size);
> if (error)
> goto out;
> @@ -610,9 +616,10 @@ static int hugetlbfs_set_page_dirty(stru
> static int hugetlbfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> {
> struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(dentry->d_sb);
> + struct hstate *h = hstate_inode(dentry->d_inode);
>
> buf->f_type = HUGETLBFS_MAGIC;
> - buf->f_bsize = HPAGE_SIZE;
> + buf->f_bsize = huge_page_size(h);
> if (sbinfo) {
> spin_lock(&sbinfo->stat_lock);
> /* If no limits set, just report 0 for max/free/used
> Index: linux/ipc/shm.c
> ===================================================================
> --- linux.orig/ipc/shm.c
> +++ linux/ipc/shm.c
> @@ -612,7 +612,8 @@ static void shm_get_stat(struct ipc_name
>
> if (is_file_hugepages(shp->shm_file)) {
> struct address_space *mapping = inode->i_mapping;
> - *rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
> + struct hstate *h = hstate_file(shp->shm_file);
> + *rss += (1 << huge_page_order(h)) * mapping->nrpages;
> } else {
> struct shmem_inode_info *info = SHMEM_I(inode);
> spin_lock(&info->lock);
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c
> +++ linux/mm/memory.c
> @@ -848,7 +848,7 @@ unsigned long unmap_vmas(struct mmu_gath
> if (unlikely(is_vm_hugetlb_page(vma))) {
> unmap_hugepage_range(vma, start, end);
> zap_work -= (end - start) /
> - (HPAGE_SIZE / PAGE_SIZE);
> + (1 << huge_page_order(hstate_vma(vma)));
> start = end;
> } else
> start = unmap_page_range(*tlbp, vma,
> Index: linux/mm/mempolicy.c
> ===================================================================
> --- linux.orig/mm/mempolicy.c
> +++ linux/mm/mempolicy.c
> @@ -1295,7 +1295,8 @@ struct zonelist *huge_zonelist(struct vm
> if (pol->policy == MPOL_INTERLEAVE) {
> unsigned nid;
>
> - nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> + nid = interleave_nid(pol, vma, addr,
> + huge_page_shift(hstate_vma(vma)));
> __mpol_free(pol); /* finished with pol */
> return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
> }
> @@ -1939,9 +1940,12 @@ static void check_huge_range(struct vm_a
> {
> unsigned long addr;
> struct page *page;
> + struct hstate *h = hstate_vma(vma);
> + unsigned sz = huge_page_size(h);
>
> - for (addr = start; addr < end; addr += HPAGE_SIZE) {
> - pte_t *ptep = huge_pte_offset(vma->vm_mm, addr & HPAGE_MASK);
> + for (addr = start; addr < end; addr += sz) {
> + pte_t *ptep = huge_pte_offset(vma->vm_mm,
> + addr & huge_page_mask(h));
> pte_t pte;
>
> if (!ptep)
> Index: linux/mm/mmap.c
> ===================================================================
> --- linux.orig/mm/mmap.c
> +++ linux/mm/mmap.c
> @@ -1793,7 +1793,8 @@ int split_vma(struct mm_struct * mm, str
> struct mempolicy *pol;
> struct vm_area_struct *new;
>
> - if (is_vm_hugetlb_page(vma) && (addr & ~HPAGE_MASK))
> + if (is_vm_hugetlb_page(vma) && (addr &
> + ~(huge_page_mask(hstate_vma(vma)))))
> return -EINVAL;
>
> if (mm->map_count >= sysctl_max_map_count)
>

Overall, despite the nits, I think this is a fairly sensible patch and
something that can be merged and tested in isolation in the interest of
getting 18 patches down to more managable bites.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-18 12:23:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [2/18] Add basic support for more than one hstate in hugetlbfs

On (17/03/08 02:58), Andi Kleen didst pronounce:
> - Convert hstates to an array
> - Add a first default entry covering the standard huge page size
> - Add functions for architectures to register new hstates
> - Add basic iterators over hstates
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> include/linux/hugetlb.h | 10 +++++++++-
> mm/hugetlb.c | 46 +++++++++++++++++++++++++++++++++++++---------
> 2 files changed, 46 insertions(+), 10 deletions(-)
>
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -27,7 +27,15 @@ unsigned long sysctl_overcommit_huge_pag
> static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
> unsigned long hugepages_treat_as_movable;
>
> -struct hstate global_hstate;
> +static int max_hstate = 1;
> +
> +struct hstate hstates[HUGE_MAX_HSTATE];
> +
> +/* for command line parsing */
> +struct hstate *parsed_hstate __initdata = &global_hstate;
> +

global_hstate becomes a misleading name in this patch. default_hstate
minimally

> +#define for_each_hstate(h) \
> + for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
>
> /*
> * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
> @@ -474,15 +482,11 @@ static struct page *alloc_huge_page(stru
> return page;
> }
>
> -static int __init hugetlb_init(void)
> +static int __init hugetlb_init_hstate(struct hstate *h)
> {
> unsigned long i;
> - struct hstate *h = &global_hstate;
>
> - if (HPAGE_SHIFT == 0)
> - return 0;
> -

Why is there no need for

if (huge_page_shift(h) == 0)
return 0;
?

ah, it's because of what you do to hugetlb_init().

> - if (!h->order) {
> + if (h == &global_hstate && !h->order) {
> h->order = HPAGE_SHIFT - PAGE_SHIFT;
> h->mask = HPAGE_MASK;
> }
> @@ -497,11 +501,34 @@ static int __init hugetlb_init(void)
> break;
> }
> max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> - printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
> +
> + printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
> + h->free_huge_pages,
> + 1 << (h->order + PAGE_SHIFT - 20));

Ah, you partially fix up my whinge from the previous patch here.

page_alloc.c has a helper called K() for conversions. Perhaps move it to
internal.h and add one for M instead of the - 20 here? Not a big deal as
it doesn't take long to figure out.

> return 0;
> }
> +
> +static int __init hugetlb_init(void)
> +{
> + if (HPAGE_SHIFT == 0)
> + return 0;
> + return hugetlb_init_hstate(&global_hstate);
> +}
> module_init(hugetlb_init);
>
> +/* Should be called on processing a hugepagesz=... option */
> +void __init huge_add_hstate(unsigned order)
> +{
> + struct hstate *h;
> + BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
> + BUG_ON(order <= HPAGE_SHIFT - PAGE_SHIFT);
> + h = &hstates[max_hstate++];
> + h->order = order;
> + h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
> + hugetlb_init_hstate(h);
> + parsed_hstate = h;
> +}

It's not clear in this patch what parsed_hstate is for as it is not used
elsewhere. I've made a note to check if parsed_hstate makes an unwritten
assumption that there is only "one other" huge page size in the system.

> +
> static int __init hugetlb_setup(char *s)
> {
> if (sscanf(s, "%lu", &max_huge_pages) <= 0)
> Index: linux/include/linux/hugetlb.h
> ===================================================================
> --- linux.orig/include/linux/hugetlb.h
> +++ linux/include/linux/hugetlb.h
> @@ -213,7 +213,15 @@ struct hstate {
> unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> };
>
> -extern struct hstate global_hstate;
> +void __init huge_add_hstate(unsigned order);
> +
> +#ifndef HUGE_MAX_HSTATE
> +#define HUGE_MAX_HSTATE 1
> +#endif
> +
> +extern struct hstate hstates[HUGE_MAX_HSTATE];
> +
> +#define global_hstate (hstates[0])
>
> static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> {
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-18 12:28:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [3/18] Convert /proc output code over to report multiple hstates

On (17/03/08 02:58), Andi Kleen didst pronounce:
> I chose to just report the numbers in a row, in the hope
> to minimze breakage of existing software. The "compat" page size
> is always the first number.
>

Glancing through the libhugetlbfs code, it appears to take the first
value after Hugepagesize: as the "huge pagesize" so I suspect you're
safe there at least FWIW.

> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/hugetlb.c | 59 +++++++++++++++++++++++++++++++++++++++--------------------
> 1 file changed, 39 insertions(+), 20 deletions(-)
>
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -683,37 +683,56 @@ int hugetlb_overcommit_handler(struct ct
>
> #endif /* CONFIG_SYSCTL */
>
> +static int dump_field(char *buf, unsigned field)
> +{
> + int n = 0;
> + struct hstate *h;
> + for_each_hstate (h)
> + n += sprintf(buf + n, " %5lu", *(unsigned long *)((char *)h + field));
> + buf[n++] = '\n';
> + return n;
> +}
> +
> int hugetlb_report_meminfo(char *buf)
> {
> - struct hstate *h = &global_hstate;
> - return sprintf(buf,
> - "HugePages_Total: %5lu\n"
> - "HugePages_Free: %5lu\n"
> - "HugePages_Rsvd: %5lu\n"
> - "HugePages_Surp: %5lu\n"
> - "Hugepagesize: %5lu kB\n",
> - h->nr_huge_pages,
> - h->free_huge_pages,
> - h->resv_huge_pages,
> - h->surplus_huge_pages,
> - 1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
> + struct hstate *h;
> + int n = 0;
> + n += sprintf(buf + 0, "HugePages_Total:");
> + n += dump_field(buf + n, offsetof(struct hstate, nr_huge_pages));
> + n += sprintf(buf + n, "HugePages_Free: ");
> + n += dump_field(buf + n, offsetof(struct hstate, free_huge_pages));
> + n += sprintf(buf + n, "HugePages_Rsvd: ");
> + n += dump_field(buf + n, offsetof(struct hstate, resv_huge_pages));
> + n += sprintf(buf + n, "HugePages_Surp: ");
> + n += dump_field(buf + n, offsetof(struct hstate, surplus_huge_pages));
> + n += sprintf(buf + n, "Hugepagesize: ");
> + for_each_hstate (h)
> + n += sprintf(buf + n, " %5u", huge_page_size(h) / 1024);
> + n += sprintf(buf + n, " kB\n");
> + return n;
> }
>
> int hugetlb_report_node_meminfo(int nid, char *buf)
> {
> - struct hstate *h = &global_hstate;
> - return sprintf(buf,
> - "Node %d HugePages_Total: %5u\n"
> - "Node %d HugePages_Free: %5u\n",
> - nid, h->nr_huge_pages_node[nid],
> - nid, h->free_huge_pages_node[nid]);
> + int n = 0;
> + n += sprintf(buf, "Node %d HugePages_Total:", nid);
> + n += dump_field(buf + n, offsetof(struct hstate,
> + nr_huge_pages_node[nid]));
> + n += sprintf(buf + n , "Node %d HugePages_Free: ", nid);
> + n += dump_field(buf + n, offsetof(struct hstate,
> + free_huge_pages_node[nid]));
> + return n;
> }
>
> /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
> unsigned long hugetlb_total_pages(void)
> {
> - struct hstate *h = &global_hstate;
> - return h->nr_huge_pages * (1 << huge_page_order(h));
> + long x = 0;
> + struct hstate *h;
> + for_each_hstate (h) {
> + x += h->nr_huge_pages * (1 << huge_page_order(h));
> + }
> + return x;
> }
>
> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-18 14:08:56

by Adam Litke

[permalink] [raw]
Subject: Re: [PATCH] [6/18] Add support to have individual hstates for each hugetlbfs mount


On Mon, 2008-03-17 at 02:58 +0100, Andi Kleen wrote:
> - Add a new pagesize= option to the hugetlbfs mount that allows setting
> the page size
> - Set up pointers to a suitable hstate for the set page size option
> to the super block and the inode and the vma.
> - Change the hstate accessors to use this information
> - Add code to the hstate init function to set parsed_hstate for command
> line processing
> - Handle duplicated hstate registrations to the make command line user proof
>
> Signed-off-by: Andi Kleen <[email protected]>

FWIW, I think this approach is definitely the way to go for supporting
multiple huge page sizes.

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2008-03-18 14:11:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [4/18] Add basic support for more than one hstate in hugetlbfs

Missing leader and the subject is misleading as to what the patch is
doing. Am assuming this is an accident.

On (17/03/08 02:58), Andi Kleen didst pronounce:
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/hugetlb.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -550,26 +550,33 @@ static unsigned int cpuset_mems_nr(unsig
>
> #ifdef CONFIG_SYSCTL
> #ifdef CONFIG_HIGHMEM
> -static void try_to_free_low(unsigned long count)
> +static void do_try_to_free_low(struct hstate *h, unsigned long count)
> {
> - struct hstate *h = &global_hstate;
> int i;
>
> for (i = 0; i < MAX_NUMNODES; ++i) {
> struct page *page, *next;
> struct list_head *freel = &h->hugepage_freelists[i];
> list_for_each_entry_safe(page, next, freel, lru) {
> - if (count >= nr_huge_pages)
> + if (count >= h->nr_huge_pages)
> return;
> if (PageHighMem(page))
> continue;
> list_del(&page->lru);
> - update_and_free_page(page);
> + update_and_free_page(h, page);
> h->free_huge_pages--;
> h->free_huge_pages_node[page_to_nid(page)]--;
> }
> }
> }
> +
> +static void try_to_free_low(unsigned long count)
> +{
> + struct hstate *h;
> + for_each_hstate (h) {
> + do_try_to_free_low(h, count);
> + }
> +}

hmm, so this is freeing 'count' pages from all pools. I doubt that's what
you really want to be doing here. If someone if using the proc entries to
shrink a pool size, I imagine they want to shrink X pages of size Y from a
single pool, not shrink X pages from all pools.

What am I missing?

> #else
> static inline void try_to_free_low(unsigned long count)
> {
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 19:31:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [7/18] Abstract out the NUMA node round robin code into a separate function

> hmm, I'm not seeing where next_nid gets declared locally here as it
> should have been removed in an earlier patch. Maybe it's reintroduced

No there was no earlier patch touching this, so the old next_nid
is still there.

-Andi

2008-03-19 19:48:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [5/18] Expand the hugetlbfs sysctls to handle arrays for all hstates

On (18/03/08 17:49), Andi Kleen didst pronounce:
> > Also, offhand it's not super-clear why max_huge_pages is not part of
> > hstate as we only expect one hstate per pagesize anyway.
>
> They need to be an separate array for the sysctl parsing function.
>

D'oh, of course. Pointing that out answers my other questions in relation to
how writing single values to a proc entry affects multiple pools as well. I
was still thinking of max_huge_pages as as a single value instead of an array.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 19:48:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [6/18] Add support to have individual hstates for each hugetlbfs mount

On (17/03/08 02:58), Andi Kleen didst pronounce:
> - Add a new pagesize= option to the hugetlbfs mount that allows setting
> the page size
> - Set up pointers to a suitable hstate for the set page size option
> to the super block and the inode and the vma.
> - Change the hstate accessors to use this information
> - Add code to the hstate init function to set parsed_hstate for command
> line processing
> - Handle duplicated hstate registrations to the make command line user proof
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> fs/hugetlbfs/inode.c | 50 ++++++++++++++++++++++++++++++++++++++----------
> include/linux/hugetlb.h | 12 ++++++++---
> mm/hugetlb.c | 22 +++++++++++++++++----
> 3 files changed, 67 insertions(+), 17 deletions(-)
>
> Index: linux/include/linux/hugetlb.h
> ===================================================================
> --- linux.orig/include/linux/hugetlb.h
> +++ linux/include/linux/hugetlb.h
> @@ -134,6 +134,7 @@ struct hugetlbfs_config {
> umode_t mode;
> long nr_blocks;
> long nr_inodes;
> + struct hstate *hstate;
> };
>
> struct hugetlbfs_sb_info {
> @@ -142,12 +143,14 @@ struct hugetlbfs_sb_info {
> long max_inodes; /* inodes allowed */
> long free_inodes; /* inodes free */
> spinlock_t stat_lock;
> + struct hstate *hstate;

Minor nit. the other parameters are tabbed out.

> };
>
>
> struct hugetlbfs_inode_info {
> struct shared_policy policy;
> struct inode vfs_inode;
> + struct hstate *hstate;
> };

I'm somewhat surprised it is necessary for the hstate to be on a
per-inode basis when it's already in the hugetlbfs_sb_info. Would
HUGETLBFS_SB(inode->i_sb)->hstate not work?

>
> static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
> @@ -212,6 +215,7 @@ struct hstate {
> };
>
> void __init huge_add_hstate(unsigned order);
> +struct hstate *huge_lookup_hstate(unsigned long pagesize);
>

lookup_hstate_pagesize() maybe? The name as-is told me nothing about what
it might do. It was the parameter name that gave it away.

> #ifndef HUGE_MAX_HSTATE
> #define HUGE_MAX_HSTATE 1
> @@ -223,17 +227,19 @@ extern struct hstate hstates[HUGE_MAX_HS
>
> static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> {
> - return &global_hstate;
> + return (struct hstate *)vma->vm_private_data;
> }

It does appear that vm_private_data is currently unused and this is safe.

>
> static inline struct hstate *hstate_file(struct file *f)
> {
> - return &global_hstate;
> + struct dentry *d = f->f_dentry;
> + struct inode *i = d->d_inode;
> + return HUGETLBFS_I(i)->hstate;

HUGETLBFS_SB(HUGETLBFS_I(i)->i_sb)->hstate ?

Pretty fugly I'll admit, but it's contained in a helper and keeps the
inode size down.

> }
>
> static inline struct hstate *hstate_inode(struct inode *i)
> {
> - return &global_hstate;
> + return HUGETLBFS_I(i)->hstate;
> }
>
> static inline unsigned huge_page_size(struct hstate *h)
> Index: linux/fs/hugetlbfs/inode.c
> ===================================================================
> --- linux.orig/fs/hugetlbfs/inode.c
> +++ linux/fs/hugetlbfs/inode.c
> @@ -53,6 +53,7 @@ int sysctl_hugetlb_shm_group;
> enum {
> Opt_size, Opt_nr_inodes,
> Opt_mode, Opt_uid, Opt_gid,
> + Opt_pagesize,
> Opt_err,
> };
>
> @@ -62,6 +63,7 @@ static match_table_t tokens = {
> {Opt_mode, "mode=%o"},
> {Opt_uid, "uid=%u"},
> {Opt_gid, "gid=%u"},
> + {Opt_pagesize, "pagesize=%s"},
> {Opt_err, NULL},
> };
>
> @@ -92,6 +94,7 @@ static int hugetlbfs_file_mmap(struct fi
> */
> vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
> vma->vm_ops = &hugetlb_vm_ops;
> + vma->vm_private_data = h;
>
> if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
> return -EINVAL;
> @@ -530,6 +533,7 @@ static struct inode *hugetlbfs_get_inode
> inode->i_op = &page_symlink_inode_operations;
> break;
> }
> + info->hstate = HUGETLBFS_SB(sb)->hstate;
> }
> return inode;
> }
> @@ -750,6 +754,8 @@ hugetlbfs_parse_options(char *options, s
> char *p, *rest;
> substring_t args[MAX_OPT_ARGS];
> int option;
> + unsigned long long size = 0;
> + enum { NO_SIZE, SIZE_STD, SIZE_PERCENT } setsize = NO_SIZE;
>
> if (!options)
> return 0;
> @@ -780,17 +786,13 @@ hugetlbfs_parse_options(char *options, s
> break;
>
> case Opt_size: {
> - unsigned long long size;
> /* memparse() will accept a K/M/G without a digit */
> if (!isdigit(*args[0].from))
> goto bad_val;
> size = memparse(args[0].from, &rest);
> - if (*rest == '%') {
> - size <<= HPAGE_SHIFT;
> - size *= max_huge_pages;
> - do_div(size, 100);
> - }
> - pconfig->nr_blocks = (size >> HPAGE_SHIFT);
> + setsize = SIZE_STD;
> + if (*rest == '%')
> + setsize = SIZE_PERCENT;
> break;
> }
>
> @@ -801,6 +803,19 @@ hugetlbfs_parse_options(char *options, s
> pconfig->nr_inodes = memparse(args[0].from, &rest);
> break;
>
> + case Opt_pagesize: {
> + unsigned long ps;
> + ps = memparse(args[0].from, &rest);
> + pconfig->hstate = huge_lookup_hstate(ps);
> + if (!pconfig->hstate) {
> + printk(KERN_ERR
> + "hugetlbfs: Unsupported page size %lu MB\n",
> + ps >> 20);
> + return -EINVAL;
> + }
> + break;
> + }
> +
> default:
> printk(KERN_ERR "hugetlbfs: Bad mount option: \"%s\"\n",
> p);
> @@ -808,6 +823,18 @@ hugetlbfs_parse_options(char *options, s
> break;
> }
> }
> +
> + /* Do size after hstate is set up */
> + if (setsize > NO_SIZE) {
> + struct hstate *h = pconfig->hstate;
> + if (setsize == SIZE_PERCENT) {
> + size <<= huge_page_shift(h);
> + size *= max_huge_pages[h - hstates];
> + do_div(size, 100);
> + }
> + pconfig->nr_blocks = (size >> huge_page_shift(h));
> + }
> +
> return 0;
>
> bad_val:
> @@ -832,6 +859,7 @@ hugetlbfs_fill_super(struct super_block
> config.uid = current->fsuid;
> config.gid = current->fsgid;
> config.mode = 0755;
> + config.hstate = &global_hstate;
> ret = hugetlbfs_parse_options(data, &config);
> if (ret)
> return ret;
> @@ -840,14 +868,15 @@ hugetlbfs_fill_super(struct super_block
> if (!sbinfo)
> return -ENOMEM;
> sb->s_fs_info = sbinfo;
> + sbinfo->hstate = config.hstate;
> spin_lock_init(&sbinfo->stat_lock);
> sbinfo->max_blocks = config.nr_blocks;
> sbinfo->free_blocks = config.nr_blocks;
> sbinfo->max_inodes = config.nr_inodes;
> sbinfo->free_inodes = config.nr_inodes;
> sb->s_maxbytes = MAX_LFS_FILESIZE;
> - sb->s_blocksize = HPAGE_SIZE;
> - sb->s_blocksize_bits = HPAGE_SHIFT;
> + sb->s_blocksize = huge_page_size(config.hstate);
> + sb->s_blocksize_bits = huge_page_shift(config.hstate);
> sb->s_magic = HUGETLBFS_MAGIC;
> sb->s_op = &hugetlbfs_ops;
> sb->s_time_gran = 1;
> @@ -949,7 +978,8 @@ struct file *hugetlb_file_setup(const ch
> goto out_dentry;
>
> error = -ENOMEM;
> - if (hugetlb_reserve_pages(inode, 0, size >> HPAGE_SHIFT))
> + if (hugetlb_reserve_pages(inode, 0,
> + size >> huge_page_shift(hstate_inode(inode))))
> goto out_inode;
>
> d_instantiate(dentry, inode);
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -143,7 +143,7 @@ static void update_and_free_page(struct
>
> static void free_huge_page(struct page *page)
> {
> - struct hstate *h = &global_hstate;
> + struct hstate *h = huge_lookup_hstate(PAGE_SIZE << compound_order(page));
> int nid = page_to_nid(page);
> struct address_space *mapping;
>
> @@ -519,7 +519,11 @@ module_init(hugetlb_init);
> /* Should be called on processing a hugepagesz=... option */
> void __init huge_add_hstate(unsigned order)
> {
> - struct hstate *h;
> + struct hstate *h = huge_lookup_hstate(PAGE_SIZE << order);
> + if (h) {
> + parsed_hstate = h;
> + return;
> + }
> BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
> BUG_ON(order <= HPAGE_SHIFT - PAGE_SHIFT);
> h = &hstates[max_hstate++];
> @@ -538,6 +542,16 @@ static int __init hugetlb_setup(char *s)
> }
> __setup("hugepages=", hugetlb_setup);
>
> +struct hstate *huge_lookup_hstate(unsigned long pagesize)
> +{
> + struct hstate *h;
> + for_each_hstate (h) {
> + if (huge_page_size(h) == pagesize)
> + return h;
> + }
> + return NULL;
> +}
> +
> static unsigned int cpuset_mems_nr(unsigned int *array)
> {
> int node;
> @@ -1345,7 +1359,7 @@ out:
> int hugetlb_reserve_pages(struct inode *inode, long from, long to)
> {
> long ret, chg;
> - struct hstate *h = &global_hstate;
> + struct hstate *h = hstate_inode(inode);
>
> chg = region_chg(&inode->i_mapping->private_list, from, to);
> if (chg < 0)
> @@ -1364,7 +1378,7 @@ int hugetlb_reserve_pages(struct inode *
>
> void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed)
> {
> - struct hstate *h = &global_hstate;
> + struct hstate *h = hstate_inode(inode);
> long chg = region_truncate(&inode->i_mapping->private_list, offset);
>
> spin_lock(&inode->i_lock);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 20:57:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [11/18] Fix alignment bug in bootmem allocator

On (17/03/08 02:58), Andi Kleen didst pronounce:
> Without this fix bootmem can return unaligned addresses when the start of a
> node is not aligned to the align value. Needed for reliably allocating
> gigabyte pages.
> Signed-off-by: Andi Kleen <[email protected]>

Seems like something that should be fixed anyway independently of your
patchset. If moved to the start of the set, it can be treated in batch with
the cleanups as well.

>
> ---
> mm/bootmem.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux/mm/bootmem.c
> ===================================================================
> --- linux.orig/mm/bootmem.c
> +++ linux/mm/bootmem.c
> @@ -197,6 +197,7 @@ __alloc_bootmem_core(struct bootmem_data
> {
> unsigned long offset, remaining_size, areasize, preferred;
> unsigned long i, start = 0, incr, eidx, end_pfn;
> + unsigned long pfn;
> void *ret;
>
> if (!size) {
> @@ -239,12 +240,13 @@ __alloc_bootmem_core(struct bootmem_data
> preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
> areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
> incr = align >> PAGE_SHIFT ? : 1;
> + pfn = PFN_DOWN(bdata->node_boot_start);
>

hmm, preferred is already been aligned above and it appears that "offset"
was meant to handle the situation you are dealing with here. Is the caller
passing in "goal" (to avoid DMA32 for example) and messing up how "offset"
is calculated?

> restart_scan:
> for (i = preferred; i < eidx; i += incr) {
> unsigned long j;
> i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
> - i = ALIGN(i, incr);
> + i = ALIGN(pfn + i, incr) - pfn;
> if (i >= eidx)
> break;
> if (test_bit(i, bdata->node_bootmem_map))
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 20:56:57

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [13/18] Add support to allocate hugepages of different size with hugepages=...

On (17/03/08 02:58), Andi Kleen didst pronounce:
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> include/linux/hugetlb.h | 1 +
> mm/hugetlb.c | 23 ++++++++++++++++++-----
> 2 files changed, 19 insertions(+), 5 deletions(-)
>
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -552,19 +552,23 @@ static int __init hugetlb_init_hstate(st
> {
> unsigned long i;
>
> - for (i = 0; i < MAX_NUMNODES; ++i)
> - INIT_LIST_HEAD(&h->hugepage_freelists[i]);
> + /* Don't reinitialize lists if they have been already init'ed */
> + if (!h->hugepage_freelists[0].next) {
> + for (i = 0; i < MAX_NUMNODES; ++i)
> + INIT_LIST_HEAD(&h->hugepage_freelists[i]);
>
> - h->hugetlb_next_nid = first_node(node_online_map);
> + h->hugetlb_next_nid = first_node(node_online_map);
> + }


hmm, it's not very clear to me how hugetlb_init_hstate() would get
called twice for the same hstate. Should it be VM_BUG_ON() if a hstate
gets initialised twice instead?

>
> - for (i = 0; i < max_huge_pages[h - hstates]; ++i) {
> + while (h->parsed_hugepages < max_huge_pages[h - hstates]) {
> if (h->order > MAX_ORDER) {
> if (!alloc_bm_huge_page(h))
> break;
> } else if (!alloc_fresh_huge_page(h))
> break;
> + h->parsed_hugepages++;
> }
> - max_huge_pages[h - hstates] = h->free_huge_pages = h->nr_huge_pages = i;
> + max_huge_pages[h - hstates] = h->parsed_hugepages;
>
> printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
> h->free_huge_pages,
> @@ -602,6 +606,15 @@ static int __init hugetlb_setup(char *s)
> unsigned long *mhp = &max_huge_pages[parsed_hstate - hstates];
> if (sscanf(s, "%lu", mhp) <= 0)
> *mhp = 0;
> + /*
> + * Global state is always initialized later in hugetlb_init.
> + * But we need to allocate > MAX_ORDER hstates here early to still
> + * use the bootmem allocator.
> + * If you add additional hstates <= MAX_ORDER you'll need
> + * to fix that.
> + */
> + if (parsed_hstate != &global_hstate)
> + hugetlb_init_hstate(parsed_hstate);
> return 1;
> }
> __setup("hugepages=", hugetlb_setup);
> Index: linux/include/linux/hugetlb.h
> ===================================================================
> --- linux.orig/include/linux/hugetlb.h
> +++ linux/include/linux/hugetlb.h
> @@ -212,6 +212,7 @@ struct hstate {
> unsigned int nr_huge_pages_node[MAX_NUMNODES];
> unsigned int free_huge_pages_node[MAX_NUMNODES];
> unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> + unsigned long parsed_hugepages;
> };
>
> void __init huge_add_hstate(unsigned order);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 20:59:00

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [12/18] Add support to allocate hugetlb pages that are larger than MAX_ORDER

On (17/03/08 02:58), Andi Kleen didst pronounce:
> This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
> not practical to enlarge MAX_ORDER to 1GB.
>
> Instead the 1GB pages are only allocated at boot using the bootmem
> allocator using the hugepages=... option.
>
> These 1G bootmem pages are never freed. In theory it would be possible
> to implement that with some complications, but since it would be a one-way
> street (> MAX_ORDER pages cannot be allocated later) I decided not to currently.
>
> The > MAX_ORDER code is not ifdef'ed per architecture. It is not very big
> and the ifdef uglyness seemed not be worth it.
>
> Known problems: /proc/meminfo and "free" do not display the memory
> allocated for gb pages in "Total". This is a little confusing for the
> user.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/hugetlb.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 62 insertions(+), 2 deletions(-)
>
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -14,6 +14,7 @@
> #include <linux/mempolicy.h>
> #include <linux/cpuset.h>
> #include <linux/mutex.h>
> +#include <linux/bootmem.h>
>
> #include <asm/page.h>
> #include <asm/pgtable.h>
> @@ -153,7 +154,7 @@ static void free_huge_page(struct page *
> INIT_LIST_HEAD(&page->lru);
>
> spin_lock(&hugetlb_lock);
> - if (h->surplus_huge_pages_node[nid]) {
> + if (h->surplus_huge_pages_node[nid] && h->order <= MAX_ORDER) {
> update_and_free_page(h, page);
> h->surplus_huge_pages--;
> h->surplus_huge_pages_node[nid]--;
> @@ -215,6 +216,9 @@ static struct page *alloc_fresh_huge_pag
> {
> struct page *page;
>
> + if (h->order > MAX_ORDER)
> + return NULL;
> +

Should this print out a KERN_INFO message to the effect that pages of
that size must be reserved at boot-time?

> page = alloc_pages_node(nid,
> htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
> huge_page_order(h));
> @@ -271,6 +275,9 @@ static struct page *alloc_buddy_huge_pag
> struct page *page;
> unsigned int nid;
>
> + if (h->order > MAX_ORDER)
> + return NULL;
> +
> /*
> * Assume we will successfully allocate the surplus page to
> * prevent racing processes from causing the surplus to exceed
> @@ -422,6 +429,10 @@ return_unused_surplus_pages(struct hstat
> /* Uncommit the reservation */
> h->resv_huge_pages -= unused_resv_pages;
>
> + /* Cannot return gigantic pages currently */
> + if (h->order > MAX_ORDER)
> + return;
> +
> nr_pages = min(unused_resv_pages, h->surplus_huge_pages);
>
> while (nr_pages) {
> @@ -499,6 +510,44 @@ static struct page *alloc_huge_page(stru
> return page;
> }
>
> +static __initdata LIST_HEAD(huge_boot_pages);
> +
> +struct huge_bm_page {
> + struct list_head list;
> + struct hstate *hstate;
> +};
> +
> +static int __init alloc_bm_huge_page(struct hstate *h)
> +{
> + struct huge_bm_page *m;
> + m = __alloc_bootmem_node_nopanic(NODE_DATA(h->hugetlb_next_nid),
> + huge_page_size(h), huge_page_size(h),
> + 0);
> + if (!m)
> + return 0;
> + BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
> + /* Put them into a private list first because mem_map is not up yet */
> + list_add(&m->list, &huge_boot_pages);
> + m->hstate = h;
> + huge_next_node(h);
> + return 1;
> +}
> +
> +/* Put bootmem huge pages into the standard lists after mem_map is up */
> +static int __init huge_init_bm(void)
> +{
> + struct huge_bm_page *m;
> + list_for_each_entry (m, &huge_boot_pages, list) {
> + struct page *page = virt_to_page(m);
> + struct hstate *h = m->hstate;
> + __ClearPageReserved(page);
> + prep_compound_page(page, h->order);
> + huge_new_page(h, page);
> + }
> + return 0;
> +}
> +__initcall(huge_init_bm);
> +
> static int __init hugetlb_init_hstate(struct hstate *h)
> {
> unsigned long i;
> @@ -509,7 +558,10 @@ static int __init hugetlb_init_hstate(st
> h->hugetlb_next_nid = first_node(node_online_map);
>
> for (i = 0; i < max_huge_pages[h - hstates]; ++i) {
> - if (!alloc_fresh_huge_page(h))
> + if (h->order > MAX_ORDER) {
> + if (!alloc_bm_huge_page(h))
> + break;
> + } else if (!alloc_fresh_huge_page(h))
> break;
> }
> max_huge_pages[h - hstates] = h->free_huge_pages = h->nr_huge_pages = i;
> @@ -581,6 +633,9 @@ static void do_try_to_free_low(struct hs
> {
> int i;
>
> + if (h->order > MAX_ORDER)
> + return;
> +
> for (i = 0; i < MAX_NUMNODES; ++i) {
> struct page *page, *next;
> struct list_head *freel = &h->hugepage_freelists[i];
> @@ -618,6 +673,11 @@ set_max_huge_pages(struct hstate *h, uns
>
> *err = 0;
>
> + if (h->order > MAX_ORDER) {
> + *err = -EINVAL;
> + return max_huge_pages[h - hstates];
> + }
> +

Ah, scratch the comment on an earlier patch where I said I cannot see
where err ever gets updated in set_max_huge_pages().

> /*
> * Increase the pool size
> * First take pages out of surplus state. Then make up the
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 20:58:14

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [7/18] Abstract out the NUMA node round robin code into a separate function

On (17/03/08 02:58), Andi Kleen didst pronounce:
> Need this as a separate function for a future patch.
>
> No behaviour change.
>
> Signed-off-by: Andi Kleen <[email protected]>

Maybe if you moved this beside patch 1, they could both be tested in
isolation as a fairly reasonable cleanup that does not alter
functionality? Not a big deal.

>
> ---
> mm/hugetlb.c | 37 ++++++++++++++++++++++---------------
> 1 file changed, 22 insertions(+), 15 deletions(-)
>
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -219,6 +219,27 @@ static struct page *alloc_fresh_huge_pag
> return page;
> }
>
> +/*
> + * Use a helper variable to find the next node and then
> + * copy it back to hugetlb_next_nid afterwards:
> + * otherwise there's a window in which a racer might
> + * pass invalid nid MAX_NUMNODES to alloc_pages_node.
> + * But we don't need to use a spin_lock here: it really
> + * doesn't matter if occasionally a racer chooses the
> + * same nid as we do. Move nid forward in the mask even
> + * if we just successfully allocated a hugepage so that
> + * the next caller gets hugepages on the next node.
> + */
> +static int huge_next_node(struct hstate *h)
> +{
> + int next_nid;
> + next_nid = next_node(h->hugetlb_next_nid, node_online_map);
> + if (next_nid == MAX_NUMNODES)
> + next_nid = first_node(node_online_map);
> + h->hugetlb_next_nid = next_nid;
> + return next_nid;
> +}
> +
> static int alloc_fresh_huge_page(struct hstate *h)
> {
> struct page *page;
> @@ -232,21 +253,7 @@ static int alloc_fresh_huge_page(struct
> page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
> if (page)
> ret = 1;
> - /*
> - * Use a helper variable to find the next node and then
> - * copy it back to hugetlb_next_nid afterwards:
> - * otherwise there's a window in which a racer might
> - * pass invalid nid MAX_NUMNODES to alloc_pages_node.
> - * But we don't need to use a spin_lock here: it really
> - * doesn't matter if occasionally a racer chooses the
> - * same nid as we do. Move nid forward in the mask even
> - * if we just successfully allocated a hugepage so that
> - * the next caller gets hugepages on the next node.
> - */
> - next_nid = next_node(h->hugetlb_next_nid, node_online_map);
> - if (next_nid == MAX_NUMNODES)
> - next_nid = first_node(node_online_map);
> - h->hugetlb_next_nid = next_nid;
> + next_nid = huge_next_node(h);

hmm, I'm not seeing where next_nid gets declared locally here as it
should have been removed in an earlier patch. Maybe it's reintroduced
later but if you do reshuffle the patchset so that the cleanups can be
merged on their own, it'll show up in a compile test.

> } while (!page && h->hugetlb_next_nid != start_nid);
>
> return ret;
>

Other than the possible gotcha with next_nid declared locally, the move
seems fine.

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 20:58:36

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [7/18] Abstract out the NUMA node round robin code into a separate function

On (18/03/08 16:47), Andi Kleen didst pronounce:
> > hmm, I'm not seeing where next_nid gets declared locally here as it
> > should have been removed in an earlier patch. Maybe it's reintroduced
>
> No there was no earlier patch touching this, so the old next_nid
> is still there.
>

ah yes, my bad. I thought it went away in patch 1/18.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 20:56:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [10/18] Factor out new huge page preparation code into separate function

On (17/03/08 02:58), Andi Kleen didst pronounce:
> Needed to avoid code duplication in follow up patches.
>
> This happens to fix a minor bug. When alloc_bootmem_node returns
> a fallback node on a different node than passed the old code
> would have put it into the free lists of the wrong node.
> Now it would end up in the freelist of the correct node.
>

It fixes a real bug for sure. It may be possible with that bug to leak
pages onto a linked list with bogus counters.

Possibly another candidate patch to move to the start of the series so
they can be merged and tested separetly?

> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/hugetlb.c | 21 +++++++++++++--------
> 1 file changed, 13 insertions(+), 8 deletions(-)
>
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -200,6 +200,17 @@ static int adjust_pool_surplus(struct hs
> return ret;
> }
>
> +static void huge_new_page(struct hstate *h, struct page *page)
> +{

prep_new_huge_page() as it has a similar responsibility to
prep_new_page() ? Just at a glance, huge_new_page() implies to me that
it calls alloc_pages_node()

> + unsigned nid = pfn_to_nid(page_to_pfn(page));
> + set_compound_page_dtor(page, free_huge_page);
> + spin_lock(&hugetlb_lock);
> + h->nr_huge_pages++;
> + h->nr_huge_pages_node[nid]++;
> + spin_unlock(&hugetlb_lock);
> + put_page(page); /* free it into the hugepage allocator */
> +}
> +
> static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
> {
> struct page *page;
> @@ -207,14 +218,8 @@ static struct page *alloc_fresh_huge_pag
> page = alloc_pages_node(nid,
> htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
> huge_page_order(h));
> - if (page) {
> - set_compound_page_dtor(page, free_huge_page);
> - spin_lock(&hugetlb_lock);
> - h->nr_huge_pages++;
> - h->nr_huge_pages_node[nid]++;
> - spin_unlock(&hugetlb_lock);
> - put_page(page); /* free it into the hugepage allocator */
> - }
> + if (page)
> + huge_new_page(h, page);
>
> return page;
> }
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 21:56:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [8/18] Add a __alloc_bootmem_node_nopanic

On (17/03/08 02:58), Andi Kleen didst pronounce:
> Straight forward variant of the existing __alloc_bootmem_node, only
> Signed-off-by: Andi Kleen <[email protected]>
>
> difference is that it doesn't panic on failure

Seems to be a bit of cut&paste jumbling there.

>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> include/linux/bootmem.h | 4 ++++
> mm/bootmem.c | 12 ++++++++++++
> 2 files changed, 16 insertions(+)
>
> Index: linux/mm/bootmem.c
> ===================================================================
> --- linux.orig/mm/bootmem.c
> +++ linux/mm/bootmem.c
> @@ -471,6 +471,18 @@ void * __init __alloc_bootmem_node(pg_da
> return __alloc_bootmem(size, align, goal);
> }
>
> +void * __init __alloc_bootmem_node_nopanic(pg_data_t *pgdat, unsigned long size,
> + unsigned long align, unsigned long goal)
> +{
> + void *ptr;
> +
> + ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal, 0);
> + if (ptr)
> + return ptr;
> +
> + return __alloc_bootmem_nopanic(size, align, goal);
> +}

Straight-forward. Mildly irritating that there are multiple variants that
only differ by whether they panic on allocation failure or not. Probably
should be a seperate removal of duplicated code there but outside the
scope of this patch.

> +
> #ifndef ARCH_LOW_ADDRESS_LIMIT
> #define ARCH_LOW_ADDRESS_LIMIT 0xffffffffUL
> #endif
> Index: linux/include/linux/bootmem.h
> ===================================================================
> --- linux.orig/include/linux/bootmem.h
> +++ linux/include/linux/bootmem.h
> @@ -90,6 +90,10 @@ extern void *__alloc_bootmem_node(pg_dat
> unsigned long size,
> unsigned long align,
> unsigned long goal);
> +extern void *__alloc_bootmem_node_nopanic(pg_data_t *pgdat,
> + unsigned long size,
> + unsigned long align,
> + unsigned long goal);
> extern unsigned long init_bootmem_node(pg_data_t *pgdat,
> unsigned long freepfn,
> unsigned long startpfn,
>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 21:57:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [13/18] Add support to allocate hugepages of different size with hugepages=...

On (18/03/08 17:45), Andi Kleen didst pronounce:
> > hmm, it's not very clear to me how hugetlb_init_hstate() would get
> > called twice for the same hstate. Should it be VM_BUG_ON() if a hstate
>
> It is called from a __setup function and the user can specify them multiple
> times. Also when the user specified the HPAGE_SIZE already and it got set up
> it should not be called again.
>

Ok, that is a fair explanation. Thanks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 21:58:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [5/18] Expand the hugetlbfs sysctls to handle arrays for all hstates

On (17/03/08 02:58), Andi Kleen didst pronounce:
> - I didn't bother with hugetlb_shm_group and treat_as_movable,
> these are still single global.

I cannot imagine why either of those would be per-pool anyway.
Potentially shm_group could become a per-mount value which is both
outside the scope of this patchset and not per-pool so unsuitable for
hstate.

> - Also improve error propagation for the sysctl handlers a bit
>
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> include/linux/hugetlb.h | 5 +++--
> kernel/sysctl.c | 2 +-
> mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++------------
> 3 files changed, 35 insertions(+), 15 deletions(-)
>
> Index: linux/include/linux/hugetlb.h
> ===================================================================
> --- linux.orig/include/linux/hugetlb.h
> +++ linux/include/linux/hugetlb.h
> @@ -32,8 +32,6 @@ int hugetlb_fault(struct mm_struct *mm,
> int hugetlb_reserve_pages(struct inode *inode, long from, long to);
> void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
>
> -extern unsigned long max_huge_pages;
> -extern unsigned long sysctl_overcommit_huge_pages;
> extern unsigned long hugepages_treat_as_movable;
> extern const unsigned long hugetlb_zero, hugetlb_infinity;
> extern int sysctl_hugetlb_shm_group;
> @@ -258,6 +256,9 @@ static inline unsigned huge_page_shift(s
> return h->order + PAGE_SHIFT;
> }
>
> +extern unsigned long max_huge_pages[HUGE_MAX_HSTATE];
> +extern unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];

Any particular reason for moving them?

Also, offhand it's not super-clear why max_huge_pages is not part of
hstate as we only expect one hstate per pagesize anyway.

> +
> #else
> struct hstate {};
> #define hstate_file(f) NULL
> Index: linux/kernel/sysctl.c
> ===================================================================
> --- linux.orig/kernel/sysctl.c
> +++ linux/kernel/sysctl.c
> @@ -935,7 +935,7 @@ static struct ctl_table vm_table[] = {
> {
> .procname = "nr_hugepages",
> .data = &max_huge_pages,
> - .maxlen = sizeof(unsigned long),
> + .maxlen = sizeof(max_huge_pages),
> .mode = 0644,
> .proc_handler = &hugetlb_sysctl_handler,
> .extra1 = (void *)&hugetlb_zero,
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -22,8 +22,8 @@
> #include "internal.h"
>
> const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> -unsigned long max_huge_pages;
> -unsigned long sysctl_overcommit_huge_pages;
> +unsigned long max_huge_pages[HUGE_MAX_HSTATE];
> +unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
> static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
> unsigned long hugepages_treat_as_movable;
>
> @@ -496,11 +496,11 @@ static int __init hugetlb_init_hstate(st
>
> h->hugetlb_next_nid = first_node(node_online_map);
>
> - for (i = 0; i < max_huge_pages; ++i) {
> + for (i = 0; i < max_huge_pages[h - hstates]; ++i) {
> if (!alloc_fresh_huge_page(h))
> break;
> }
> - max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> + max_huge_pages[h - hstates] = h->free_huge_pages = h->nr_huge_pages = i;
>

hmm ok, it looks a little weird to be working out h - hstates multiple times
in a loop when it is invariant but functionally, it's fine.

> printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
> h->free_huge_pages,
> @@ -531,8 +531,9 @@ void __init huge_add_hstate(unsigned ord
>
> static int __init hugetlb_setup(char *s)
> {
> - if (sscanf(s, "%lu", &max_huge_pages) <= 0)
> - max_huge_pages = 0;
> + unsigned long *mhp = &max_huge_pages[parsed_hstate - hstates];

This looks like we are assuming there is only ever one other
parsed_hstate. For the purposes of what you aim to achieve in this set,
it's not important but a comment over parsed_hstate about this
assumption is probably necessary.

> + if (sscanf(s, "%lu", mhp) <= 0)
> + *mhp = 0;
> return 1;
> }
> __setup("hugepages=", hugetlb_setup);
> @@ -584,10 +585,12 @@ static inline void try_to_free_low(unsig
> #endif
>
> #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static unsigned long set_max_huge_pages(unsigned long count)
> +static unsigned long
> +set_max_huge_pages(struct hstate *h, unsigned long count, int *err)
> {
> unsigned long min_count, ret;
> - struct hstate *h = &global_hstate;
> +
> + *err = 0;
>

What is updating err to anything else in set_max_huge_pages()?

> /*
> * Increase the pool size
> @@ -659,8 +662,20 @@ int hugetlb_sysctl_handler(struct ctl_ta
> struct file *file, void __user *buffer,
> size_t *length, loff_t *ppos)
> {
> - proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> - max_huge_pages = set_max_huge_pages(max_huge_pages);
> + int err = 0;
> + struct hstate *h;
> + int i;
> + err = proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> + if (err)
> + return err;
> + i = 0;
> + for_each_hstate (h) {
> + max_huge_pages[i] = set_max_huge_pages(h, max_huge_pages[i],
> + &err);

hmm, this is saying when I write 10 to nr_hugepages, I am asking for 10
2MB pages and 10 1GB pages potentially. Is that what you want?

> + if (err)
> + return err;

I'm failing to see how the error handling is improved when
set_max_huge_pages() is not updating err. Maybe it happens in another
patch.

> + i++;
> + }
> return 0;
> }
>
> @@ -680,10 +695,14 @@ int hugetlb_overcommit_handler(struct ct
> struct file *file, void __user *buffer,
> size_t *length, loff_t *ppos)
> {
> - struct hstate *h = &global_hstate;
> + struct hstate *h;
> + int i = 0;
> proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> spin_lock(&hugetlb_lock);
> - h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages;
> + for_each_hstate (h) {
> + h->nr_overcommit_huge_pages = sysctl_overcommit_huge_pages[i];
> + i++;
> + }

Similar to the other sysctl here, the overcommit value is being set for
all the huge page sizes.

> spin_unlock(&hugetlb_lock);
> return 0;
> }
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-19 22:05:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [5/18] Expand the hugetlbfs sysctls to handle arrays for all hstates

> Also, offhand it's not super-clear why max_huge_pages is not part of
> hstate as we only expect one hstate per pagesize anyway.

They need to be an separate array for the sysctl parsing function.

-Andi

2008-03-19 22:06:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/18] Add support to allocate hugepages of different size with hugepages=...

> hmm, it's not very clear to me how hugetlb_init_hstate() would get
> called twice for the same hstate. Should it be VM_BUG_ON() if a hstate
It is called from a __setup function and the user can specify them multiple
times. Also when the user specified the HPAGE_SIZE already and it got set up
it should not be called again.

-Andi

2008-03-19 22:13:15

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] [14/18] Clean up hugetlb boot time printk

On (17/03/08 02:58), Andi Kleen didst pronounce:
> - Reword sentence to clarify meaning with multiple options
> - Add support for using GB prefixes for the page size
> - Add extra printk to delayed > MAX_ORDER allocation code
>

Scratch earlier comments about this printk. If the printk fix
was broken out, it could be moved to the start of the set so it can be
tested/merged separetly. The remainder of this patch could then be
folded into the patch allowing 1GB pages to be reserved at boot-time.

> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> mm/hugetlb.c | 33 ++++++++++++++++++++++++++++++---
> 1 file changed, 30 insertions(+), 3 deletions(-)
>
> Index: linux/mm/hugetlb.c
> ===================================================================
> --- linux.orig/mm/hugetlb.c
> +++ linux/mm/hugetlb.c
> @@ -510,6 +510,15 @@ static struct page *alloc_huge_page(stru
> return page;
> }
>
> +static __init char *memfmt(char *buf, unsigned long n)
> +{
> + if (n >= (1UL << 30))
> + sprintf(buf, "%lu GB", n >> 30);
> + else
> + sprintf(buf, "%lu MB", n >> 20);
> + return buf;
> +}
> +
> static __initdata LIST_HEAD(huge_boot_pages);
>
> struct huge_bm_page {
> @@ -536,14 +545,28 @@ static int __init alloc_bm_huge_page(str
> /* Put bootmem huge pages into the standard lists after mem_map is up */
> static int __init huge_init_bm(void)
> {
> + unsigned long pages = 0;
> struct huge_bm_page *m;
> + struct hstate *h = NULL;
> + char buf[32];
> +
> list_for_each_entry (m, &huge_boot_pages, list) {
> struct page *page = virt_to_page(m);
> - struct hstate *h = m->hstate;
> + h = m->hstate;
> __ClearPageReserved(page);
> prep_compound_page(page, h->order);
> huge_new_page(h, page);
> + pages++;
> }
> +
> + /*
> + * This only prints for a single hstate. This works for x86-64,
> + * but if you do multiple > MAX_ORDER hstates you'll need to fix it.
> + */
> + if (pages > 0)
> + printk(KERN_INFO "HugeTLB pre-allocated %ld %s pages\n",
> + h->free_huge_pages,
> + memfmt(buf, huge_page_size(h)));
> return 0;
> }
> __initcall(huge_init_bm);
> @@ -551,6 +574,8 @@ __initcall(huge_init_bm);
> static int __init hugetlb_init_hstate(struct hstate *h)
> {
> unsigned long i;
> + char buf[32];
> + unsigned long pages = 0;
>
> /* Don't reinitialize lists if they have been already init'ed */
> if (!h->hugepage_freelists[0].next) {
> @@ -567,12 +592,14 @@ static int __init hugetlb_init_hstate(st
> } else if (!alloc_fresh_huge_page(h))
> break;
> h->parsed_hugepages++;
> + pages++;
> }
> max_huge_pages[h - hstates] = h->parsed_hugepages;
>
> - printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
> + if (pages > 0)
> + printk(KERN_INFO "HugeTLB pre-allocated %ld %s pages\n",
> h->free_huge_pages,
> - 1 << (h->order + PAGE_SHIFT - 20));
> + memfmt(buf, huge_page_size(h)));
> return 0;
> }
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-03-23 10:39:00

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] [2/18] Add basic support for more than one hstate in hugetlbfs

Hi Andi

sorry for very late review.

> @@ -497,11 +501,34 @@ static int __init hugetlb_init(void)
> break;
> }
> max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
> - printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
> +
> + printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
> + h->free_huge_pages,
> + 1 << (h->order + PAGE_SHIFT - 20));
> return 0;
> }

IA64 arch support 64k hugepage, assumption >1MB size is wrong.


> +/* Should be called on processing a hugepagesz=... option */
> +void __init huge_add_hstate(unsigned order)
> +{
> + struct hstate *h;
> + BUG_ON(max_hstate >= HUGE_MAX_HSTATE);
> + BUG_ON(order <= HPAGE_SHIFT - PAGE_SHIFT);
> + h = &hstates[max_hstate++];
> + h->order = order;
> + h->mask = ~((1ULL << (order + PAGE_SHIFT)) - 1);
> + hugetlb_init_hstate(h);
> + parsed_hstate = h;
> +}

this function is called once by one boot parameter, right?
if so, this function cause panic when stupid user write many
hugepagesz boot parameter.

Why don't you use following check.

if (max_hstate >= HUGE_MAX_HSTATE) {
printk("hoge hoge");
return;
}



- kosaki

2008-03-23 11:25:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [2/18] Add basic support for more than one hstate in hugetlbfs

> this function is called once by one boot parameter, right?
> if so, this function cause panic when stupid user write many
> hugepagesz boot parameter.

A later patch fixes that up by looking up the hstate explicitely. Also it
is bisect safe because the callers are only added later.

-Andi

2008-03-23 11:30:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] [2/18] Add basic support for more than one hstate in hugetlbfs

> > this function is called once by one boot parameter, right?
> > if so, this function cause panic when stupid user write many
> > hugepagesz boot parameter.
>
> A later patch fixes that up by looking up the hstate explicitely. Also it
> is bisect safe because the callers are only added later.

Oops, sorry.