2004-10-22 05:04:10

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V1 [0/4]: Discussion and overview

This is a revised edition of the hugetlb demand page patches by
Kenneth Chen which were discussed in the following thread in August 2004

http://marc.theaimsgroup.com/?t=109171285000004&r=1&w=2

The initial post by Ken was in April in

http://marc.theaimsgroup.com/?l=linux-ia64&m=108189860401704&w=2

Hugetlb demand paging has been part of the production release of SuSE SLES
9 for awhile now (used by such products as Oracle) and this patchset is
intended to help hugetlb demand paging also get into the official Linux kernel.

This first version of the patchset is a collection of the patches from the
above mentioned thread in August. The key unresolved issue in that thread was
the necessity of using update_cache_mmu after setting up a huge pte.

update_cache_mmu is intended to update the mmu cache for a PAGESIZE page and
not for a huge page. The solution adopted here (as already suggested as
a possible solution in that thread) is to require an extension of
the semantics of set_huge_pte: set_huge_pte() must also do for huge pages
what update_cache_mmu does for PAGESIZE pages. For that purpose an additional
address parameter was added to set_huge_pte() which will conviently break
any old code. The patch included hopefully already fixes all occurrences
of set_huge_pte.

A linux-2.6.9-bk5 kernel with this patchset was build on IA64 and successfully
tested using a performance test program for hugetlb pages.

Note that this is the first patchset and is to be seen as discussion basis.
not as a final patchset. Please review these patches.

The patchset consists of 4 patches.

1/4 Demand Paging patch. This is the base and is mostly Ken's original work
plus a fix that was posted later.

2/4 set_huge_pte update. This updates the set_huge_pte function for all
architectures and insures that the arch specific action for
update_mmu_cache is taken (which may be do nothing for some arches).
Please verify that this really addresses the issues for each arch
and that it is complete.

3/4 Overcommit patch: Mostly the original work by Ken plus a fix that he
posted later.

4/4 Numa patch: Work by Raymund Bryant and myself at SGI to make
the huge page allocator try to allocate local memory instead always
starting at node zero. This definitely needs to be more sophisticated.

Patches 1 to 3 must be applied together. The Numa patch is optional.


2004-10-22 05:07:28

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V1 [2/4]: set_huge_pte() arch updates

Changelog
* Update set_huge_pte throughout all arches
* set_huge_pte has an additional address argument
* set_huge_pte must also do what update_mmu_cache typically does
for PAGESIZE ptes.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/hugetlbpage.c 2004-10-21 20:17:44.000000000 -0700
@@ -57,7 +57,8 @@
#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)

void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
- struct page *page, pte_t * page_table, int write_access)
+ struct page *page, pte_t * page_table, int write_access,
+ unsigned long address)
{
unsigned long i;
pte_t entry;
@@ -74,6 +75,7 @@

for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
set_pte(page_table, entry);
+ update_mmu_cache(vma, address, entry);
page_table++;

pte_val(entry) += PAGE_SIZE;
Index: linux-2.6.9/arch/sh64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/sh64/mm/hugetlbpage.c 2004-10-18 14:53:46.000000000 -0700
+++ linux-2.6.9/arch/sh64/mm/hugetlbpage.c 2004-10-21 20:26:15.000000000 -0700
@@ -57,7 +57,8 @@
#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)

static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
- struct page *page, pte_t * page_table, int write_access)
+ struct page *page, pte_t * page_table, int write_access,
+ unsigned long address)
{
unsigned long i;
pte_t entry;
@@ -256,7 +257,7 @@
goto out;
}
}
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+ set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE, addr);
}
out:
spin_unlock(&mm->page_table_lock);
Index: linux-2.6.9/include/linux/hugetlb.h
===================================================================
--- linux-2.6.9.orig/include/linux/hugetlb.h 2004-10-21 14:50:14.000000000 -0700
+++ linux-2.6.9/include/linux/hugetlb.h 2004-10-21 20:22:45.000000000 -0700
@@ -18,7 +18,7 @@
void zap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
pte_t *huge_pte_alloc(struct mm_struct *, unsigned long);
-void set_huge_pte(struct mm_struct *, struct vm_area_struct *, struct page *, pte_t *, int);
+void set_huge_pte(struct mm_struct *, struct vm_area_struct *, struct page *, pte_t *, int, unsigned long);
int handle_hugetlb_mm_fault(struct mm_struct *, struct vm_area_struct *, unsigned long, int);

int hugetlb_report_meminfo(char *);
Index: linux-2.6.9/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/sparc64/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
+++ linux-2.6.9/arch/sparc64/mm/hugetlbpage.c 2004-10-21 20:20:20.000000000 -0700
@@ -54,7 +54,8 @@
#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)

void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
- struct page *page, pte_t * page_table, int write_access)
+ struct page *page, pte_t * page_table, int write_access,
+ unsigned long address)
{
unsigned long i;
pte_t entry;
@@ -71,6 +72,7 @@

for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
set_pte(page_table, entry);
+ update_mmu_cache(vma, address, entry)
page_table++;

pte_val(entry) += PAGE_SIZE;
Index: linux-2.6.9/arch/i386/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/i386/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
+++ linux-2.6.9/arch/i386/mm/hugetlbpage.c 2004-10-21 20:18:36.000000000 -0700
@@ -54,7 +54,8 @@
return (pte_t *) pmd;
}

-void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page, pte_t * page_table, int write_access)
+void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page,
+ pte_t * page_table, int write_access, unsigned long address)
{
pte_t entry;

Index: linux-2.6.9/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
+++ linux-2.6.9/arch/ia64/mm/hugetlbpage.c 2004-10-21 20:25:02.000000000 -0700
@@ -61,7 +61,7 @@

void
set_huge_pte (struct mm_struct *mm, struct vm_area_struct *vma,
- struct page *page, pte_t * page_table, int write_access)
+ struct page *page, pte_t * page_table, int write_access, unsigned long address)
{
pte_t entry;

@@ -74,6 +74,7 @@
entry = pte_mkyoung(entry);
mk_pte_huge(entry);
set_pte(page_table, entry);
+ update_mmu_cache(vma, address, entry);
return;
}
/*

2004-10-22 05:13:13

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V1 [4/4]: Numa patch

Changelog
* NUMA enhancements (rough first implementation)
* Do not begin search for huge page memory at the first node
but start at the current node and then search previous and
the following nodes for memory.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/mm/hugetlb.c
===================================================================
--- linux-2.6.9.orig/mm/hugetlb.c 2004-10-21 20:39:50.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c 2004-10-21 20:44:12.000000000 -0700
@@ -28,15 +28,30 @@
free_huge_pages_node[nid]++;
}

-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct vm_area_struct *vma, unsigned long addr)
{
int nid = numa_node_id();
+ int tid, nid2;
struct page *page = NULL;

if (list_empty(&hugepage_freelists[nid])) {
- for (nid = 0; nid < MAX_NUMNODES; ++nid)
- if (!list_empty(&hugepage_freelists[nid]))
- break;
+ /* Prefer the neighboring nodes */
+ for (tid =1 ; tid < MAX_NUMNODES; tid++) {
+
+ /* Is there space in a following node ? */
+ nid2 = (nid + tid) % MAX_NUMNODES;
+ if (mpol_node_valid(nid2, vma, addr) &&
+ !list_empty(&hugepage_freelists[nid2]))
+ break;
+
+ /* or in an previous node ? */
+ if (tid > nid) continue;
+ nid2 = nid - tid;
+ if (mpol_node_valid(nid2, vma, addr) &&
+ !list_empty(&hugepage_freelists[nid2]))
+ break;
+ }
+ nid = nid2;
}
if (nid >= 0 && nid < MAX_NUMNODES &&
!list_empty(&hugepage_freelists[nid])) {
@@ -75,13 +90,13 @@
spin_unlock(&hugetlb_lock);
}

-struct page *alloc_huge_page(void)
+struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr)
{
struct page *page;
int i;

spin_lock(&hugetlb_lock);
- page = dequeue_huge_page();
+ page = dequeue_huge_page(vma, addr);
if (!page) {
spin_unlock(&hugetlb_lock);
return NULL;
@@ -181,7 +196,7 @@
spin_lock(&hugetlb_lock);
try_to_free_low(count);
while (count < nr_huge_pages) {
- struct page *page = dequeue_huge_page();
+ struct page *page = dequeue_huge_page(NULL, 0);
if (!page)
break;
update_and_free_page(page);
@@ -255,7 +270,7 @@
retry:
page = find_get_page(mapping, idx);
if (!page) {
- page = alloc_huge_page();
+ page = alloc_huge_page(vma, addr);
if (!page)
/*
* with strict overcommit accounting, we should never
Index: linux-2.6.9/include/linux/hugetlb.h
===================================================================
--- linux-2.6.9.orig/include/linux/hugetlb.h 2004-10-21 20:44:10.000000000 -0700
+++ linux-2.6.9/include/linux/hugetlb.h 2004-10-21 20:44:56.000000000 -0700
@@ -31,7 +31,7 @@
pmd_t *pmd, int write);
int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
int pmd_huge(pmd_t pmd);
-struct page *alloc_huge_page(void);
+struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr);
void free_huge_page(struct page *);

extern unsigned long max_huge_pages;

2004-10-22 05:11:31

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V1 [3/4]: Overcommit handling

Changelog
* overcommit handling

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.9.orig/fs/hugetlbfs/inode.c 2004-10-21 14:50:14.000000000 -0700
+++ linux-2.6.9/fs/hugetlbfs/inode.c 2004-10-21 20:02:23.000000000 -0700
@@ -32,6 +32,206 @@
/* some random number */
#define HUGETLBFS_MAGIC 0x958458f6

+/* Convert loff_t and PAGE_SIZE counts to hugetlb page counts. */
+#define VMACCT(x) ((x) >> (HPAGE_SHIFT))
+#define VMACCTPG(x) ((x) >> (HPAGE_SHIFT - PAGE_SHIFT))
+
+static long hugetlbzone_resv;
+static spinlock_t hugetlbfs_lock = SPIN_LOCK_UNLOCKED;
+
+int hugetlb_acct_memory(long delta)
+{
+ int ret = 0;
+
+ spin_lock(&hugetlbfs_lock);
+ if (delta > 0 && (hugetlbzone_resv + delta) >
+ VMACCTPG(hugetlb_total_pages()))
+ ret = -ENOMEM;
+ else
+ hugetlbzone_resv += delta;
+ spin_unlock(&hugetlbfs_lock);
+ return ret;
+}
+
+struct file_region {
+ struct list_head link;
+ long from;
+ long to;
+};
+
+static int region_add(struct list_head *head, int f, int t)
+{
+ struct file_region *rg;
+ struct file_region *nrg;
+ struct file_region *trg;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* Add a new region if the existing region starts above our end.
+ * We should already have a space to record. */
+ if (&rg->link == head || t < rg->from)
+ BUG();
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (f > rg->from)
+ f = rg->from;
+
+ /* Check for and consume any regions we now overlap with. */
+ nrg = rg;
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > t)
+ break;
+
+ /* If this area reaches higher then extend our area to
+ * include it completely. If this is not the first area
+ * which we intend to reuse, free it. */
+ if (rg->to > t)
+ t = rg->to;
+ if (rg != nrg) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ }
+ nrg->from = f;
+ nrg->to = t;
+ return 0;
+}
+
+static int region_chg(struct list_head *head, int f, int t)
+{
+ struct file_region *rg;
+ struct file_region *nrg;
+ loff_t chg = 0;
+
+ /* Locate the region we are before or in. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* If we are below the current region then a new region is required.
+ * Subtle, allocate a new region at the position but make it zero
+ * size such that we can guarentee to record the reservation. */
+ if (&rg->link == head || t < rg->from) {
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ if (nrg == 0)
+ return -ENOMEM;
+ nrg->from = f;
+ nrg->to = f;
+ INIT_LIST_HEAD(&nrg->link);
+ list_add(&nrg->link, rg->link.prev);
+
+ return t - f;
+ }
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (f > rg->from)
+ f = rg->from;
+ chg = t - f;
+
+ /* Check for and consume any regions we now overlap with. */
+ list_for_each_entry(rg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > t)
+ return chg;
+
+ /* We overlap with this area, if it extends futher than
+ * us then we must extend ourselves. Account for its
+ * existing reservation. */
+ if (rg->to > t) {
+ chg += rg->to - t;
+ t = rg->to;
+ }
+ chg -= rg->to - rg->from;
+ }
+ return chg;
+}
+
+static int region_truncate(struct list_head *head, int end)
+{
+ struct file_region *rg;
+ struct file_region *trg;
+ int chg = 0;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (end <= rg->to)
+ break;
+ if (&rg->link == head)
+ return 0;
+
+ /* If we are in the middle of a region then adjust it. */
+ if (end > rg->from) {
+ chg = rg->to - end;
+ rg->to = end;
+ rg = list_entry(rg->link.next, typeof(*rg), link);
+ }
+
+ /* Drop any remaining regions. */
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ chg += rg->to - rg->from;
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ return chg;
+}
+
+#if 0
+static int region_dump(struct list_head *head)
+{
+ struct file_region *rg;
+
+ list_for_each_entry(rg, head, link)
+ printk(KERN_WARNING "rg<%p> f<%lld> t<%lld>\n",
+ rg, rg->from, rg->to);
+ return 0;
+}
+#endif
+
+/* Calculate the commitment change that this mapping implies
+ * and check it against both the commitment and quota limits. */
+static int hugetlb_acct_commit(struct inode *inode, int from, int to)
+{
+ int chg;
+ int ret;
+
+ chg = region_chg(&inode->i_mapping->private_list, from, to);
+ if (chg < 0)
+ return chg;
+ ret = hugetlb_acct_memory(chg);
+ if (ret < 0)
+ return ret;
+ ret = hugetlb_get_quota(inode->i_mapping, chg);
+ if (ret < 0)
+ goto undo_commit;
+ ret = region_add(&inode->i_mapping->private_list, from, to);
+ return ret;
+
+undo_commit:
+ hugetlb_acct_memory(-chg);
+ return ret;
+}
+static void hugetlb_acct_release(struct inode *inode, int to)
+{
+ int chg;
+
+ chg = region_truncate(&inode->i_mapping->private_list, to);
+ hugetlb_acct_memory(-chg);
+ hugetlb_put_quota(inode->i_mapping, chg);
+}
+
+int hugetlbfs_report_meminfo(char *buf)
+{
+ return sprintf(buf, "HugePages_Reserved: %5lu\n", hugetlbzone_resv);
+}
+
static struct super_operations hugetlbfs_ops;
static struct address_space_operations hugetlbfs_aops;
struct file_operations hugetlbfs_file_operations;
@@ -48,7 +248,6 @@
static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_dentry->d_inode;
- struct address_space *mapping = inode->i_mapping;
loff_t len, vma_len;
int ret;

@@ -79,7 +278,10 @@
if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
goto out;

- if (inode->i_size < len)
+ ret = hugetlb_acct_commit(inode, VMACCTPG(vma->vm_pgoff),
+ VMACCTPG(vma->vm_pgoff + (vma_len >> PAGE_SHIFT)));
+
+ if (ret >= 0 && inode->i_size < len)
inode->i_size = len;
out:
up(&inode->i_sem);
@@ -194,7 +396,6 @@
++next;
truncate_huge_page(page);
unlock_page(page);
- hugetlb_put_quota(mapping);
}
huge_pagevec_release(&pvec);
}
@@ -213,6 +414,7 @@

if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);
+ hugetlb_acct_release(inode, 0);

security_inode_delete(inode);

@@ -254,6 +456,7 @@
spin_unlock(&inode_lock);
if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);
+ hugetlb_acct_release(inode, 0);

if (sbinfo->free_inodes >= 0) {
spin_lock(&sbinfo->stat_lock);
@@ -324,6 +527,7 @@
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
spin_unlock(&mapping->i_mmap_lock);
truncate_hugepages(mapping, offset);
+ hugetlb_acct_release(inode, VMACCT(offset));
return 0;
}

@@ -378,6 +582,7 @@
inode->i_blocks = 0;
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
+ INIT_LIST_HEAD(&inode->i_mapping->private_list);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
info = HUGETLBFS_I(inode);
mpol_shared_policy_init(&info->policy);
@@ -669,15 +874,15 @@
return -ENOMEM;
}

-int hugetlb_get_quota(struct address_space *mapping)
+int hugetlb_get_quota(struct address_space *mapping, int blocks)
{
int ret = 0;
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);

if (sbinfo->free_blocks > -1) {
spin_lock(&sbinfo->stat_lock);
- if (sbinfo->free_blocks > 0)
- sbinfo->free_blocks--;
+ if (sbinfo->free_blocks >= blocks)
+ sbinfo->free_blocks -= blocks;
else
ret = -ENOMEM;
spin_unlock(&sbinfo->stat_lock);
@@ -686,13 +891,13 @@
return ret;
}

-void hugetlb_put_quota(struct address_space *mapping)
+void hugetlb_put_quota(struct address_space *mapping, int blocks)
{
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);

if (sbinfo->free_blocks > -1) {
spin_lock(&sbinfo->stat_lock);
- sbinfo->free_blocks++;
+ sbinfo->free_blocks += blocks;
spin_unlock(&sbinfo->stat_lock);
}
}
@@ -745,9 +950,6 @@
if (!can_do_hugetlb_shm())
return ERR_PTR(-EPERM);

- if (!is_hugepage_mem_enough(size))
- return ERR_PTR(-ENOMEM);
-
if (!user_shm_lock(size, current->user))
return ERR_PTR(-ENOMEM);

@@ -779,6 +981,14 @@
file->f_mapping = inode->i_mapping;
file->f_op = &hugetlbfs_file_operations;
file->f_mode = FMODE_WRITE | FMODE_READ;
+
+ /* Account for the memory usage for this segment at create time.
+ * This maintains the commit on shmget() semantics of normal
+ * shared memory segments. */
+ error = hugetlb_acct_commit(inode, 0, VMACCT(size));
+ if (error < 0)
+ goto out_file;
+
return file;

out_file:
Index: linux-2.6.9/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.9.orig/fs/proc/proc_misc.c 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/fs/proc/proc_misc.c 2004-10-21 20:01:09.000000000 -0700
@@ -235,6 +235,7 @@
vmi.largest_chunk
);

+ len += hugetlbfs_report_meminfo(page + len);
len += hugetlb_report_meminfo(page + len);

return proc_calc_metrics(page, start, off, count, eof, len);
Index: linux-2.6.9/include/linux/hugetlb.h
===================================================================
--- linux-2.6.9.orig/include/linux/hugetlb.h 2004-10-21 14:50:14.000000000 -0700
+++ linux-2.6.9/include/linux/hugetlb.h 2004-10-21 20:01:09.000000000 -0700
@@ -122,8 +122,8 @@
extern struct file_operations hugetlbfs_file_operations;
extern struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_zero_setup(size_t);
-int hugetlb_get_quota(struct address_space *mapping);
-void hugetlb_put_quota(struct address_space *mapping);
+int hugetlb_get_quota(struct address_space *mapping, int blocks);
+void hugetlb_put_quota(struct address_space *mapping, int blocks);

static inline int is_file_hugepages(struct file *file)
{
@@ -134,11 +134,14 @@
{
file->f_op = &hugetlbfs_file_operations;
}
+int hugetlbfs_report_meminfo(char *);
+
#else /* !CONFIG_HUGETLBFS */

#define is_file_hugepages(file) 0
#define set_file_hugepages(file) BUG()
#define hugetlb_zero_setup(size) ERR_PTR(-ENOSYS)
+#define hugetlbfs_report_meminfo(buf) 0

#endif /* !CONFIG_HUGETLBFS */


2004-10-22 05:11:31

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V1 [1/4]: demand paging core

ChangeLog
* provide huge page fault handler

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.9/arch/i386/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/i386/mm/hugetlbpage.c 2004-10-21 12:01:21.000000000 -0700
+++ linux-2.6.9/arch/i386/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
@@ -18,13 +18,26 @@
#include <asm/tlb.h>
#include <asm/tlbflush.h>

-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+static void scrub_one_pmd(pmd_t * pmd)
+{
+ struct page *page;
+
+ if (pmd && !pmd_none(*pmd) && !pmd_huge(*pmd)) {
+ page = pmd_page(*pmd);
+ pmd_clear(pmd);
+ dec_page_state(nr_page_table_pages);
+ page_cache_release(page);
+ }
+}
+
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pmd_t *pmd = NULL;

pgd = pgd_offset(mm, addr);
pmd = pmd_alloc(mm, pgd, addr);
+ scrub_one_pmd(pmd);
return (pte_t *) pmd;
}

@@ -34,11 +47,14 @@
pmd_t *pmd = NULL;

pgd = pgd_offset(mm, addr);
- pmd = pmd_offset(pgd, addr);
+ if (pgd_present(*pgd)) {
+ pmd = pmd_offset(pgd, addr);
+ scrub_one_pmd(pmd);
+ }
return (pte_t *) pmd;
}

-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page, pte_t * page_table, int write_access)
+void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page, pte_t * page_table, int write_access)
{
pte_t entry;

@@ -73,17 +89,18 @@
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;

- while (addr < end) {
+ for (; addr < end; addr+= HPAGE_SIZE) {
+ src_pte = huge_pte_offset(src, addr);
+ if (!src_pte || pte_none(*src_pte))
+ continue;
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
- src_pte = huge_pte_offset(src, addr);
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
set_pte(dst_pte, entry);
dst->rss += (HPAGE_SIZE / PAGE_SIZE);
- addr += HPAGE_SIZE;
}
return 0;

@@ -217,68 +234,8 @@
continue;
page = pte_page(pte);
put_page(page);
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
}

-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- BUG_ON(vma->vm_start & ~HPAGE_MASK);
- BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
-
- if (!pte_none(*pte)) {
- pmd_t *pmd = (pmd_t *) pte;
-
- page = pmd_page(*pmd);
- pmd_clear(pmd);
- mm->nr_ptes--;
- dec_page_state(nr_page_table_pages);
- page_cache_release(page);
- }
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
- }
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
-}
Index: linux-2.6.9/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/mm/hugetlbpage.c 2004-10-18 14:54:27.000000000 -0700
+++ linux-2.6.9/arch/ia64/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
@@ -24,7 +24,7 @@

unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;

-static pte_t *
+pte_t *
huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
{
unsigned long taddr = htlbpage_to_page(addr);
@@ -59,7 +59,7 @@

#define mk_pte_huge(entry) { pte_val(entry) |= _PAGE_P; }

-static void
+void
set_huge_pte (struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page, pte_t * page_table, int write_access)
{
@@ -99,17 +99,18 @@
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;

- while (addr < end) {
+ for (; addr < end; addr += HPAGE_SIZE) {
+ src_pte = huge_pte_offset(src, addr);
+ if (!src_pte || pte_none(*src_pte))
+ continue;
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
- src_pte = huge_pte_offset(src, addr);
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
set_pte(dst_pte, entry);
dst->rss += (HPAGE_SIZE / PAGE_SIZE);
- addr += HPAGE_SIZE;
}
return 0;
nomem:
@@ -243,69 +244,16 @@

for (address = start; address < end; address += HPAGE_SIZE) {
pte = huge_pte_offset(mm, address);
- if (pte_none(*pte))
+ if (!pte || pte_none(*pte))
continue;
page = pte_page(*pte);
put_page(page);
pte_clear(pte);
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
}

-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- BUG_ON(vma->vm_start & ~HPAGE_MASK);
- BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
- if (!pte_none(*pte))
- continue;
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- page_cache_release(page);
- goto out;
- }
- }
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
-}
-
unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
{
Index: linux-2.6.9/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/ppc64/mm/hugetlbpage.c 2004-10-21 12:01:21.000000000 -0700
+++ linux-2.6.9/arch/ppc64/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
@@ -408,66 +408,9 @@
pte, local);

put_page(page);
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
put_cpu();
-
- mm->rss -= (end - start) >> PAGE_SHIFT;
-}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- WARN_ON(!is_vm_hugetlb_page(vma));
- BUG_ON((vma->vm_start % HPAGE_SIZE) != 0);
- BUG_ON((vma->vm_end % HPAGE_SIZE) != 0);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- hugepte_t *pte = hugepte_alloc(mm, addr);
- struct page *page;
-
- BUG_ON(!in_hugepage_area(mm->context, addr));
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
- if (!hugepte_none(*pte))
- continue;
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
- }
- setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
}

/* Because we have an exclusive hugepage region which lies within the
@@ -863,3 +806,59 @@

ppc_md.hpte_invalidate(slot, va, 1, local);
}
+
+int
+handle_hugetlb_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+ unsigned long addr, int write_access)
+{
+ hugepte_t *pte;
+ struct page *page;
+ struct address_space *mapping;
+ int idx, ret;
+
+ spin_lock(&mm->page_table_lock);
+ pte = hugepte_alloc(mm, addr & HPAGE_MASK);
+ if (!pte)
+ goto oom;
+ if (!hugepte_none(*pte))
+ goto out;
+ spin_unlock(&mm->page_table_lock);
+
+ mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+ idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
+ + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+retry:
+ page = find_get_page(mapping, idx);
+ if (!page) {
+ page = alloc_huge_page();
+ if (!page)
+ /*
+ * with strict overcommit accounting, we should never
+ * run out of hugetlb page, so must be a fault race
+ * and let's retry.
+ */
+ goto retry;
+ ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+ if (!ret) {
+ unlock_page(page);
+ } else {
+ put_page(page);
+ if (ret == -EEXIST)
+ goto retry;
+ else
+ return VM_FAULT_OOM;
+ }
+ }
+
+ spin_lock(&mm->page_table_lock);
+ if (hugepte_none(*pte))
+ setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE);
+ else
+ put_page(page);
+out:
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_MINOR;
+oom:
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_OOM;
+}
Index: linux-2.6.9/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/hugetlbpage.c 2004-10-18 14:54:32.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
@@ -24,7 +24,7 @@
#include <asm/tlbflush.h>
#include <asm/cacheflush.h>

-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pmd_t *pmd;
@@ -56,7 +56,7 @@

#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)

-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page, pte_t * page_table, int write_access)
{
unsigned long i;
@@ -101,12 +101,13 @@
unsigned long end = vma->vm_end;
int i;

- while (addr < end) {
+ for (; addr < end; addr += HPAGE_SIZE) {
+ src_pte = huge_pte_offset(src, addr);
+ if (!src_pte || pte_none(*src_pte))
+ continue;
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
- src_pte = huge_pte_offset(src, addr);
- BUG_ON(!src_pte || pte_none(*src_pte));
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
@@ -116,7 +117,6 @@
dst_pte++;
}
dst->rss += (HPAGE_SIZE / PAGE_SIZE);
- addr += HPAGE_SIZE;
}
return 0;

@@ -196,8 +196,7 @@

for (address = start; address < end; address += HPAGE_SIZE) {
pte = huge_pte_offset(mm, address);
- BUG_ON(!pte);
- if (pte_none(*pte))
+ if (!pte || pte_none(*pte))
continue;
page = pte_page(*pte);
put_page(page);
@@ -205,60 +204,7 @@
pte_clear(pte);
pte++;
}
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- BUG_ON(vma->vm_start & ~HPAGE_MASK);
- BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
- if (!pte_none(*pte))
- continue;
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
- }
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
-}
Index: linux-2.6.9/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/sparc64/mm/hugetlbpage.c 2004-10-18 14:54:38.000000000 -0700
+++ linux-2.6.9/arch/sparc64/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
@@ -21,7 +21,7 @@
#include <asm/tlbflush.h>
#include <asm/cacheflush.h>

-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pmd_t *pmd;
@@ -53,7 +53,7 @@

#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)

-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page, pte_t * page_table, int write_access)
{
unsigned long i;
@@ -98,12 +98,13 @@
unsigned long end = vma->vm_end;
int i;

- while (addr < end) {
+ for (; addr < end; addr += HPAGE_SIZE) {
+ src_pte = huge_pte_offset(src, addr);
+ if (!src_pte || pte_none(*src_pte))
+ continue;
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
- src_pte = huge_pte_offset(src, addr);
- BUG_ON(!src_pte || pte_none(*src_pte));
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
@@ -113,7 +114,6 @@
dst_pte++;
}
dst->rss += (HPAGE_SIZE / PAGE_SIZE);
- addr += HPAGE_SIZE;
}
return 0;

@@ -193,8 +193,7 @@

for (address = start; address < end; address += HPAGE_SIZE) {
pte = huge_pte_offset(mm, address);
- BUG_ON(!pte);
- if (pte_none(*pte))
+ if (!pte || pte_none(*pte))
continue;
page = pte_page(*pte);
put_page(page);
@@ -202,60 +201,7 @@
pte_clear(pte);
pte++;
}
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- BUG_ON(vma->vm_start & ~HPAGE_MASK);
- BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
- if (!pte_none(*pte))
- continue;
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
- }
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
-}
Index: linux-2.6.9/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.9.orig/fs/hugetlbfs/inode.c 2004-10-18 14:55:07.000000000 -0700
+++ linux-2.6.9/fs/hugetlbfs/inode.c 2004-10-21 14:50:14.000000000 -0700
@@ -79,10 +79,6 @@
if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
goto out;

- ret = hugetlb_prefault(mapping, vma);
- if (ret)
- goto out;
-
if (inode->i_size < len)
inode->i_size = len;
out:
Index: linux-2.6.9/include/linux/hugetlb.h
===================================================================
--- linux-2.6.9.orig/include/linux/hugetlb.h 2004-10-18 14:54:08.000000000 -0700
+++ linux-2.6.9/include/linux/hugetlb.h 2004-10-21 14:50:14.000000000 -0700
@@ -17,7 +17,10 @@
int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
void zap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
-int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
+pte_t *huge_pte_alloc(struct mm_struct *, unsigned long);
+void set_huge_pte(struct mm_struct *, struct vm_area_struct *, struct page *, pte_t *, int);
+int handle_hugetlb_mm_fault(struct mm_struct *, struct vm_area_struct *, unsigned long, int);
+
int hugetlb_report_meminfo(char *);
int hugetlb_report_node_meminfo(int, char *);
int is_hugepage_mem_enough(size_t);
@@ -61,7 +64,7 @@
#define follow_hugetlb_page(m,v,p,vs,a,b,i) ({ BUG(); 0; })
#define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
-#define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
+#define handle_hugetlb_mm_fault(mm, vma, addr, write) VM_FAULT_SIGBUS
#define zap_hugepage_range(vma, start, len) BUG()
#define unmap_hugepage_range(vma, start, end) BUG()
#define is_hugepage_mem_enough(size) 0
Index: linux-2.6.9/mm/hugetlb.c
===================================================================
--- linux-2.6.9.orig/mm/hugetlb.c 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c 2004-10-21 20:39:50.000000000 -0700
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/mm.h>
#include <linux/hugetlb.h>
+#include <linux/pagemap.h>
#include <linux/sysctl.h>
#include <linux/highmem.h>

@@ -231,11 +232,65 @@
}
EXPORT_SYMBOL(hugetlb_total_pages);

+int __attribute__ ((weak))
+handle_hugetlb_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+ unsigned long addr, int write_access)
+{
+ pte_t *pte;
+ struct page *page;
+ struct address_space *mapping;
+ int idx, ret;
+
+ spin_lock(&mm->page_table_lock);
+ pte = huge_pte_alloc(mm, addr & HPAGE_MASK);
+ if (!pte)
+ goto oom;
+ if (!pte_none(*pte))
+ goto out;
+ spin_unlock(&mm->page_table_lock);
+
+ mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+ idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
+ + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+retry:
+ page = find_get_page(mapping, idx);
+ if (!page) {
+ page = alloc_huge_page();
+ if (!page)
+ /*
+ * with strict overcommit accounting, we should never
+ * run out of hugetlb page, so must be a fault race
+ * and let's retry.
+ */
+ goto retry;
+ ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+ if (!ret) {
+ unlock_page(page);
+ } else {
+ put_page(page);
+ if (ret == -EEXIST)
+ goto retry;
+ else
+ return VM_FAULT_OOM;
+ }
+ }
+
+ spin_lock(&mm->page_table_lock);
+ if (pte_none(*pte))
+ set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE, addr);
+ else
+ put_page(page);
+out:
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_MINOR;
+oom:
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_OOM;
+}
+
/*
- * We cannot handle pagefaults against hugetlb pages at all. They cause
- * handle_mm_fault() to try to instantiate regular-sized pages in the
- * hugegpage VMA. do_page_fault() is supposed to trap this, so BUG is we get
- * this far.
+ * We should not get here because handle_mm_fault() is supposed to trap
+ * hugetlb page fault. BUG if we get here.
*/
static struct page *hugetlb_nopage(struct vm_area_struct *vma,
unsigned long address, int *unused)
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/mm/memory.c 2004-10-21 14:50:14.000000000 -0700
@@ -765,11 +765,6 @@
|| !(flags & vma->vm_flags))
return i ? : -EFAULT;

- if (is_vm_hugetlb_page(vma)) {
- i = follow_hugetlb_page(mm, vma, pages, vmas,
- &start, &len, i);
- continue;
- }
spin_lock(&mm->page_table_lock);
do {
struct page *map;
@@ -1693,7 +1688,7 @@
inc_page_state(pgfault);

if (is_vm_hugetlb_page(vma))
- return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ return handle_hugetlb_mm_fault(mm, vma, address, write_access);

/*
* We need the page table lock to synchronize with kswapd

2004-10-22 06:11:39

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Hugepages demand paging V1 [4/4]: Numa patch

Christoph Lameter wrote on Thursday, October 21, 2004 9:59 PM
> Changelog
> * NUMA enhancements (rough first implementation)
> * Do not begin search for huge page memory at the first node
> but start at the current node and then search previous and
> the following nodes for memory.
>
> -static struct page *dequeue_huge_page(void)
> +static struct page *dequeue_huge_page(struct vm_area_struct *vma, unsigned long addr)
> {
> int nid = numa_node_id();
> + int tid, nid2;
> struct page *page = NULL;
>
> if (list_empty(&hugepage_freelists[nid])) {
> - for (nid = 0; nid < MAX_NUMNODES; ++nid)
> - if (!list_empty(&hugepage_freelists[nid]))
> - break;
> + /* Prefer the neighboring nodes */
> + for (tid =1 ; tid < MAX_NUMNODES; tid++) {
> +
> + /* Is there space in a following node ? */
> + nid2 = (nid + tid) % MAX_NUMNODES;
> + if (mpol_node_valid(nid2, vma, addr) &&
> + !list_empty(&hugepage_freelists[nid2]))
> + break;
> +
> + /* or in an previous node ? */
> + if (tid > nid) continue;
> + nid2 = nid - tid;
> + if (mpol_node_valid(nid2, vma, addr) &&
> + !list_empty(&hugepage_freelists[nid2]))
> + break;

Are you sure about this? Looked flawed to me. Logical node number
does not directly correlate to numa memory hierarchy.

- Ken


2004-10-22 10:30:22

by Andrew Morton

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [3/4]: Overcommit handling

Christoph Lameter <[email protected]> wrote:
>
> * overcommit handling

What does this do, and why do we want it?

2004-10-22 10:37:14

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [2/4]: set_huge_pte() arch updates

On Thu, Oct 21, 2004 at 09:57:23PM -0700, Christoph Lameter wrote:
> Changelog
> * Update set_huge_pte throughout all arches
> * set_huge_pte has an additional address argument
> * set_huge_pte must also do what update_mmu_cache typically does
> for PAGESIZE ptes.
> Signed-off-by: Christoph Lameter <[email protected]>

What's described above is not what the patch implements. The patch is
calling update_mmu_cache() in a loop on all the virtual base pages of a
virtual hugepage, which won't help at all, as it doesn't understand how
to find the hugepages regardless of virtual address. AFAICT code to
actually do the equivalent of update_mmu_cache() on hugepages most
likely involves privileged instructions and perhaps digging around some
cpu-specific data structures (e.g. the natively architected pagetables
bearing no resemblance to Linux') for almost every non-x86 architecture.


-- wli

2004-10-22 10:48:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [1/4]: demand paging core

On Thu, Oct 21, 2004 at 09:56:27PM -0700, Christoph Lameter wrote:
> +static void scrub_one_pmd(pmd_t * pmd)
> +{
> + struct page *page;
> +
> + if (pmd && !pmd_none(*pmd) && !pmd_huge(*pmd)) {
> + page = pmd_page(*pmd);
> + pmd_clear(pmd);
> + dec_page_state(nr_page_table_pages);
> + page_cache_release(page);
> + }
> +}

It would be nicer to fix the pagetable leak (over the lifetime of a
process) in the core instead of sprinkling hugetlb with this.


-- wli

2004-10-22 10:50:04

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [3/4]: Overcommit handling

On Thu, Oct 21, 2004 at 09:58:26PM -0700, Christoph Lameter wrote:
> Changelog
> * overcommit handling
> Signed-off-by: Christoph Lameter <[email protected]>

I can make this out, but this probably actually needs to be presented
in slow motion so more than the two of us and its original authors can
work on it (and so I don't forget everything about it in 12-18 months).


-- wli

2004-10-22 11:00:50

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [4/4]: Numa patch

On Thu, Oct 21, 2004 at 09:58:54PM -0700, Christoph Lameter wrote:
> Changelog
> * NUMA enhancements (rough first implementation)
> * Do not begin search for huge page memory at the first node
> but start at the current node and then search previous and
> the following nodes for memory.
> Signed-off-by: Christoph Lameter <[email protected]>

dequeue_huge_page() seems to want a nodemask, not a vma, though I
suppose it's not particularly pressing.


> Index: linux-2.6.9/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.9.orig/mm/hugetlb.c 2004-10-21 20:39:50.000000000 -0700
> +++ linux-2.6.9/mm/hugetlb.c 2004-10-21 20:44:12.000000000 -0700
> @@ -28,15 +28,30 @@
> free_huge_pages_node[nid]++;
> }
>
> -static struct page *dequeue_huge_page(void)
> +static struct page *dequeue_huge_page(struct vm_area_struct *vma, unsigned long addr)
> {
> int nid = numa_node_id();
> + int tid, nid2;
> struct page *page = NULL;
>
> if (list_empty(&hugepage_freelists[nid])) {
> - for (nid = 0; nid < MAX_NUMNODES; ++nid)
> - if (!list_empty(&hugepage_freelists[nid]))
> - break;
> + /* Prefer the neighboring nodes */
> + for (tid =1 ; tid < MAX_NUMNODES; tid++) {
> +
> + /* Is there space in a following node ? */
> + nid2 = (nid + tid) % MAX_NUMNODES;
> + if (mpol_node_valid(nid2, vma, addr) &&
> + !list_empty(&hugepage_freelists[nid2]))
> + break;
> +
> + /* or in an previous node ? */
> + if (tid > nid) continue;
> + nid2 = nid - tid;
> + if (mpol_node_valid(nid2, vma, addr) &&
> + !list_empty(&hugepage_freelists[nid2]))
> + break;
> + }
> + nid = nid2;
> }
> if (nid >= 0 && nid < MAX_NUMNODES &&
> !list_empty(&hugepage_freelists[nid])) {
> @@ -75,13 +90,13 @@
> spin_unlock(&hugetlb_lock);
> }
>
> -struct page *alloc_huge_page(void)
> +struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr)
> {
> struct page *page;
> int i;
>
> spin_lock(&hugetlb_lock);
> - page = dequeue_huge_page();
> + page = dequeue_huge_page(vma, addr);
> if (!page) {
> spin_unlock(&hugetlb_lock);
> return NULL;
> @@ -181,7 +196,7 @@
> spin_lock(&hugetlb_lock);
> try_to_free_low(count);
> while (count < nr_huge_pages) {
> - struct page *page = dequeue_huge_page();
> + struct page *page = dequeue_huge_page(NULL, 0);
> if (!page)
> break;
> update_and_free_page(page);
> @@ -255,7 +270,7 @@
> retry:
> page = find_get_page(mapping, idx);
> if (!page) {
> - page = alloc_huge_page();
> + page = alloc_huge_page(vma, addr);
> if (!page)
> /*
> * with strict overcommit accounting, we should never
> Index: linux-2.6.9/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.9.orig/include/linux/hugetlb.h 2004-10-21 20:44:10.000000000 -0700
> +++ linux-2.6.9/include/linux/hugetlb.h 2004-10-21 20:44:56.000000000 -0700
> @@ -31,7 +31,7 @@
> pmd_t *pmd, int write);
> int is_aligned_hugepage_range(unsigned long addr, unsigned long len);
> int pmd_huge(pmd_t pmd);
> -struct page *alloc_huge_page(void);
> +struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr);
> void free_huge_page(struct page *);
>
> extern unsigned long max_huge_pages;
>

2004-10-22 11:01:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [3/4]: Overcommit handling

On Thu, Oct 21, 2004 at 09:58:26PM -0700, Christoph Lameter wrote:
> Changelog
> * overcommit handling

overcommit for huge pages sounds like a realy bad idea. Care to explain
why you want it?

2004-10-22 11:13:19

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [3/4]: Overcommit handling

On Thu, Oct 21, 2004 at 09:58:26PM -0700, Christoph Lameter wrote:
>> Changelog
>> * overcommit handling

On Fri, Oct 22, 2004 at 12:01:16PM +0100, Christoph Hellwig wrote:
> overcommit for huge pages sounds like a realy bad idea. Care to explain
> why you want it?

It's the opposite of what its name implies; it implements strict
non-overcommit, in the sense that it tries to prevent the sum of
possible hugetlb allocations arising from handling hugetlb faults from
exceeding the size of the hugetlb memory pool.


-- wli

2004-10-22 11:16:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [3/4]: Overcommit handling

On Fri, Oct 22, 2004 at 04:12:59AM -0700, William Lee Irwin III wrote:
> On Thu, Oct 21, 2004 at 09:58:26PM -0700, Christoph Lameter wrote:
> >> Changelog
> >> * overcommit handling
>
> On Fri, Oct 22, 2004 at 12:01:16PM +0100, Christoph Hellwig wrote:
> > overcommit for huge pages sounds like a realy bad idea. Care to explain
> > why you want it?
>
> It's the opposite of what its name implies; it implements strict
> non-overcommit, in the sense that it tries to prevent the sum of
> possible hugetlb allocations arising from handling hugetlb faults from
> exceeding the size of the hugetlb memory pool.

I thought that was the state of the art for hugetlb pages already?

2004-10-22 11:21:19

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [3/4]: Overcommit handling

On Fri, Oct 22, 2004 at 12:01:16PM +0100, Christoph Hellwig wrote:
>>> overcommit for huge pages sounds like a realy bad idea. Care to explain
>>> why you want it?

On Fri, Oct 22, 2004 at 04:12:59AM -0700, William Lee Irwin III wrote:
> > It's the opposite of what its name implies; it implements strict
> > non-overcommit, in the sense that it tries to prevent the sum of
> > possible hugetlb allocations arising from handling hugetlb faults from
> > exceeding the size of the hugetlb memory pool.

On Fri, Oct 22, 2004 at 12:16:26PM +0100, Christoph Hellwig wrote:
> I thought that was the state of the art for hugetlb pages already?

Only vacuously so, for mainline is not handling hugetlb faults.

The real impediment to all this is that no one is bothering to dredge
up architecture manuals for the architectures they're touching to
create plausible equivalents of update_mmu_cache(), clear_dcache_page()
(not considered by Lameter's patches at all), et al for hugetlb.


-- wli

2004-10-22 11:23:56

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [3/4]: Overcommit handling

On Fri, Oct 22, 2004 at 12:16:26PM +0100, Christoph Hellwig wrote:
>> I thought that was the state of the art for hugetlb pages already?

On Fri, Oct 22, 2004 at 04:21:01AM -0700, William Lee Irwin III wrote:
> Only vacuously so, for mainline is not handling hugetlb faults.
> The real impediment to all this is that no one is bothering to dredge
> up architecture manuals for the architectures they're touching to
> create plausible equivalents of update_mmu_cache(), clear_dcache_page()
> (not considered by Lameter's patches at all), et al for hugetlb.

flush_dcache_page(), sorry.


-- wli

2004-10-22 15:10:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [3/4]: Overcommit handling

On Fri, 22 Oct 2004, Andrew Morton wrote:

> Christoph Lameter <[email protected]> wrote:
> >
> > * overcommit handling
>
> What does this do, and why do we want it?

It was posted with that explanation by Ken. Will look at it and clarify
its purpose for the second round.

2004-10-22 15:34:29

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [1/4]: demand paging core

On Fri, 22 Oct 2004, William Lee Irwin III wrote:

> On Thu, Oct 21, 2004 at 09:56:27PM -0700, Christoph Lameter wrote:
> > +static void scrub_one_pmd(pmd_t * pmd)
> > +{
> > + struct page *page;
> > +
> > + if (pmd && !pmd_none(*pmd) && !pmd_huge(*pmd)) {
> > + page = pmd_page(*pmd);
> > + pmd_clear(pmd);
> > + dec_page_state(nr_page_table_pages);
> > + page_cache_release(page);
> > + }
> > +}
>
> It would be nicer to fix the pagetable leak (over the lifetime of a
> process) in the core instead of sprinkling hugetlb with this.

Yes, I contacted you yesterday about this. Could you take this on? CC me
on the patches and we will try to support you as best as we can.

2004-10-22 15:35:23

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [2/4]: set_huge_pte() arch updates

On Fri, 22 Oct 2004, William Lee Irwin III wrote:

> On Thu, Oct 21, 2004 at 09:57:23PM -0700, Christoph Lameter wrote:
> > Changelog
> > * Update set_huge_pte throughout all arches
> > * set_huge_pte has an additional address argument
> > * set_huge_pte must also do what update_mmu_cache typically does
> > for PAGESIZE ptes.
> > Signed-off-by: Christoph Lameter <[email protected]>
>
> What's described above is not what the patch implements. The patch is
> calling update_mmu_cache() in a loop on all the virtual base pages of a
> virtual hugepage, which won't help at all, as it doesn't understand how
> to find the hugepages regardless of virtual address. AFAICT code to
> actually do the equivalent of update_mmu_cache() on hugepages most
> likely involves privileged instructions and perhaps digging around some
> cpu-specific data structures (e.g. the natively architected pagetables
> bearing no resemblance to Linux') for almost every non-x86 architecture.

The looping is architecture specific. But you are right for the
architectures where I simply did the loop that is wrong. The address given
needs to be correctly calculated which it is not.

Again this is arch specific stuff and can be done as needed for any
architecture. There is no intend to generalize this.

2004-10-22 15:35:23

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Hugepages demand paging V1 [3/4]: Overcommit handling

Andrew Morton wrote on Friday, October 22, 2004 3:28 AM
> Christoph Lameter <[email protected]> wrote:
> >
> > * overcommit handling
>
> What does this do, and why do we want it?

The name of "overcommit" is definitely misleading. It means the
opposite. Since physical hugetlb backing page pool is fixed and
controlled by sys admin, the fault handler can not allocate free
pages when hugetlb page pool exhaust because of user over commits
hugetlb page. Thus we enforce strict accounting upfront to
guarantee that there will be hugetlb page available at fault time.

The down side of not having it is we SIGBUS if the kernel doesn't
have any free hugetlb pages left.

- Ken


2004-10-22 15:42:48

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [2/4]: set_huge_pte() arch updates

On Fri, 22 Oct 2004, William Lee Irwin III wrote:
>> What's described above is not what the patch implements. The patch is
>> calling update_mmu_cache() in a loop on all the virtual base pages of a
>> virtual hugepage, which won't help at all, as it doesn't understand how
>> to find the hugepages regardless of virtual address. AFAICT code to
>> actually do the equivalent of update_mmu_cache() on hugepages most
>> likely involves privileged instructions and perhaps digging around some
>> cpu-specific data structures (e.g. the natively architected pagetables
>> bearing no resemblance to Linux') for almost every non-x86 architecture.

On Fri, Oct 22, 2004 at 08:32:34AM -0700, Christoph Lameter wrote:
> The looping is architecture specific. But you are right for the
> architectures where I simply did the loop that is wrong. The address given
> needs to be correctly calculated which it is not.
> Again this is arch specific stuff and can be done as needed for any
> architecture. There is no intend to generalize this.

The "model", as it were, for pagetable updats, is a 3-stage model:
(1) prepare
(2) update
(3) commit

update_mmu_cache() is stage (3). Linux' software pagetable
modifications are step (2). Generally, architectures are trying to fold
stages (2) and (3) together in set_huge_pte(), so update_mmu_cache() is
not so much of an issue for them (and it's actually incorrect to do
this multiple times or without the architectures' awareness of hugepages
in update_mmu_cache() and so on). To completely regularize the hugetlb
case, merely move the commitment to cpu-native structures or other TLB
insertion done within set_huge_pte() to a new case in update_mmu_cache()
for large pages.

What is in fact far more pressing is flush_dcache_page(), without a
correct implementation of which for hugetlb, user-visible data
corruption follows.


-- wli

2004-10-22 19:40:04

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [4/4]: Numa patch

On Fri, 22 Oct 2004, William Lee Irwin III wrote:

> On Thu, Oct 21, 2004 at 09:58:54PM -0700, Christoph Lameter wrote:
> > Changelog
> > * NUMA enhancements (rough first implementation)
> > * Do not begin search for huge page memory at the first node
> > but start at the current node and then search previous and
> > the following nodes for memory.
> > Signed-off-by: Christoph Lameter <[email protected]>
>
> dequeue_huge_page() seems to want a nodemask, not a vma, though I
> suppose it's not particularly pressing.

How about this variation following __alloc_page:

Index: linux-2.6.9/mm/hugetlb.c
===================================================================
--- linux-2.6.9.orig/mm/hugetlb.c 2004-10-21 20:39:50.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c 2004-10-22 10:53:18.000000000 -0700
@@ -32,14 +32,17 @@
{
int nid = numa_node_id();
struct page *page = NULL;
-
- if (list_empty(&hugepage_freelists[nid])) {
- for (nid = 0; nid < MAX_NUMNODES; ++nid)
- if (!list_empty(&hugepage_freelists[nid]))
- break;
+ struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;
+ struct zone **zones = zonelist->zones;
+ struct zone *z;
+ int i;
+
+ for(i=0; (z = zones[i])!= NULL; i++) {
+ nid = z->zone_pgdat->node_id;
+ if (list_empty(&hugepage_freelists[node_id]))
+ break;
}
- if (nid >= 0 && nid < MAX_NUMNODES &&
- !list_empty(&hugepage_freelists[nid])) {
+ if (z) {
page = list_entry(hugepage_freelists[nid].next,
struct page, lru);
list_del(&page->lru);

2004-10-22 19:48:55

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [4/4]: Numa patch

On Fri, 22 Oct 2004, William Lee Irwin III wrote:
>> dequeue_huge_page() seems to want a nodemask, not a vma, though I
>> suppose it's not particularly pressing.

On Fri, Oct 22, 2004 at 12:37:13PM -0700, Christoph Lameter wrote:
> How about this variation following __alloc_page:

Looks reasonable. The bit that struck me as quirky was the mpol_* on
the NULL vma. This pretty much eliminates the hidden dispatch, so I'm
happy.


-- wli

2004-10-22 20:36:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [2/4]: set_huge_pte() arch updates

On Fri, 22 Oct 2004, William Lee Irwin III wrote:

> update_mmu_cache() is stage (3). Linux' software pagetable
> modifications are step (2). Generally, architectures are trying to fold
> stages (2) and (3) together in set_huge_pte(), so update_mmu_cache() is
> not so much of an issue for them (and it's actually incorrect to do
> this multiple times or without the architectures' awareness of hugepages
> in update_mmu_cache() and so on). To completely regularize the hugetlb
> case, merely move the commitment to cpu-native structures or other TLB
> insertion done within set_huge_pte() to a new case in update_mmu_cache()
> for large pages.

Ok. I have done so in the following patch but no for all archs yet. Will
work on that next week and then post V2 of the patch.

> What is in fact far more pressing is flush_dcache_page(), without a
> correct implementation of which for hugetlb, user-visible data
> corruption follows.

When is flush_dcache_page used on a huge page?

It seems that the i386 simply does nothing for flush_dcache_page. IA64
defers to update_mmu_cache by setting PG_arch_1. So these two are ok as
is.

The other archs are likely much more involved than this. I tried to find
an simple way to identify a page as a huge page via struct page but
that is not that easy and its likely not good to follow
pointers to pointers in such a critical function. Maybe we need to add a
new page flag?

Index: linux-2.6.9/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/mm/init.c 2004-10-21 12:01:21.000000000 -0700
+++ linux-2.6.9/arch/ia64/mm/init.c 2004-10-22 13:07:36.000000000 -0700
@@ -95,6 +95,26 @@
set_bit(PG_arch_1, &page->flags); /* mark page as clean */
}

+void
+huge_update_mmu_cache_hpte (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte)
+{
+ unsigned long addr;
+ struct page *page;
+
+ if (!pte_exec(pte))
+ return; /* not an executable page... */
+
+ page = pte_page(pte);
+ /* don't use VADDR: it may not be mapped on this CPU (or may have just been flushed): */
+ addr = (unsigned long) page_address(page);
+
+ if (test_bit(PG_arch_1, &page->flags))
+ return; /* i-cache is already coherent with d-cache */
+
+ flush_icache_range(addr, addr + HPAGE_SIZE);
+ set_bit(PG_arch_1, &page->flags); /* mark page as clean */
+}
+
inline void
ia64_set_rbs_bot (void)
{
Index: linux-2.6.9/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgtable.h 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgtable.h 2004-10-22 13:23:55.000000000 -0700
@@ -482,6 +482,7 @@
* flushing that may be necessary.
*/
extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
+extern void huge_update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);

#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
Index: linux-2.6.9/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable.h 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgtable.h 2004-10-22 13:09:46.000000000 -0700
@@ -389,6 +389,7 @@
* bit at the same time.
*/
#define update_mmu_cache(vma,address,pte) do { } while (0)
+#define huge_update_mmu_cache(vma,address,pte) do { } while (0)
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
do { \
Index: linux-2.6.9/include/asm-sparc64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-sparc64/pgtable.h 2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-sparc64/pgtable.h 2004-10-22 13:13:40.000000000 -0700
@@ -347,6 +347,7 @@

struct vm_area_struct;
extern void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t);
+extern void huge_update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t);

/* Make a non-present pseudo-TTE. */
static inline pte_t mk_pte_io(unsigned long page, pgprot_t prot, int space)
Index: linux-2.6.9/include/asm-sh64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh64/pgtable.h 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/include/asm-sh64/pgtable.h 2004-10-22 13:11:52.000000000 -0700
@@ -462,6 +462,7 @@

extern void update_mmu_cache(struct vm_area_struct * vma,
unsigned long address, pte_t pte);
+#define huge_update_mmu_cache update_mmu_cache

/* Encode and decode a swap entry */
#define __swp_type(x) (((x).val & 3) + (((x).val >> 1) & 0x3c))

2004-10-22 20:51:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [2/4]: set_huge_pte() arch updates

On Fri, 22 Oct 2004, William Lee Irwin III wrote:

> It's not done at all for hugepages now, and needs to be. Fault handling
> on hugetlb vmas will likely expose the cahing of stale data more readily.

Hmm... Looks like there is a long way ahead of us.

2004-10-22 20:46:45

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [2/4]: set_huge_pte() arch updates

On Fri, 22 Oct 2004, William Lee Irwin III wrote:
>> What is in fact far more pressing is flush_dcache_page(), without a
>> correct implementation of which for hugetlb, user-visible data
>> corruption follows.

On Fri, Oct 22, 2004 at 01:29:24PM -0700, Christoph Lameter wrote:
> When is flush_dcache_page used on a huge page?
> It seems that the i386 simply does nothing for flush_dcache_page. IA64
> defers to update_mmu_cache by setting PG_arch_1. So these two are ok as
> is.
> The other archs are likely much more involved than this. I tried to find
> an simple way to identify a page as a huge page via struct page but
> that is not that easy and its likely not good to follow
> pointers to pointers in such a critical function. Maybe we need to add a
> new page flag?

It's not done at all for hugepages now, and needs to be. Fault handling
on hugetlb vmas will likely expose the cahing of stale data more readily.


-- wli

2004-10-25 21:12:08

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Hugepages demand paging V1 [4/4]: Numa patch

Christoph Lameter wrote on Friday, October 22, 2004 12:37 PM
> > On Thu, Oct 21, 2004 at 09:58:54PM -0700, Christoph Lameter wrote:
> > > Changelog
> > > * NUMA enhancements (rough first implementation)
> > > * Do not begin search for huge page memory at the first node
> > > but start at the current node and then search previous and
> > > the following nodes for memory.
> > > Signed-off-by: Christoph Lameter <[email protected]>
> >
> > dequeue_huge_page() seems to want a nodemask, not a vma, though I
> > suppose it's not particularly pressing.
>
> How about this variation following __alloc_page:
>
> @@ -32,14 +32,17 @@
> + struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;
> + struct zone **zones = zonelist->zones;
> + struct zone *z;
> + int i;
> +
> + for(i=0; (z = zones[i])!= NULL; i++) {
> + nid = z->zone_pgdat->node_id;
> + if (list_empty(&hugepage_freelists[node_id]))
> + break;
> }

Must be typos in the if statement. Two fatal errors here: You don't
really mean to break out of the for loop if there are no hugetlb page
on that node, do you? The variable name to index into the freelist is
wrong, should be nid, otherwise this code won't compile. That line
should be this:

+ if (!list_empty(&hugepage_freelists[nid]))


Also this is generic code, we should consider scanning ZONE_HIGHMEM
zonelist. Otherwise, this will likely screw up x86 numa machine.

- Ken


2004-10-25 21:29:13

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Hugepages demand paging V1 [4/4]: Numa patch

On Fri, 22 Oct 2004, William Lee Irwin III wrote:
>> dequeue_huge_page() seems to want a nodemask, not a vma, though I
>> suppose it's not particularly pressing.

On Fri, Oct 22, 2004 at 12:37:13PM -0700, Christoph Lameter wrote:
> How about this variation following __alloc_page:

William Lee Irwin III wrote on Friday, October 22, 2004 12:41 PM
> Looks reasonable. The bit that struck me as quirky was the mpol_* on
> the NULL vma. This pretty much eliminates the hidden dispatch, so I'm
> happy.

The allocate from next best node is orthogonal to hugetlb demand paging.
This should be merged once all the bugs are fixed and later when demand
paging goes in, we can add the mpol_* stuff.

- Ken


2004-10-25 21:59:43

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [4/4]: Numa patch

On Fri, Oct 22, 2004 at 12:37:13PM -0700, Christoph Lameter wrote:
>> How about this variation following __alloc_page:

William Lee Irwin III wrote on Friday, October 22, 2004 12:41 PM
>> Looks reasonable. The bit that struck me as quirky was the mpol_* on
>> the NULL vma. This pretty much eliminates the hidden dispatch, so I'm
>> happy.

On Mon, Oct 25, 2004 at 02:25:09PM -0700, Chen, Kenneth W wrote:
> The allocate from next best node is orthogonal to hugetlb demand paging.
> This should be merged once all the bugs are fixed and later when demand
> paging goes in, we can add the mpol_* stuff.

I'm not too picky about this. It appears to be the 4th of the series,
so assuming they go in in order that should meet your expectations. I
am significantly more concerned about the flush_dcache_page() issue in
general, though. I guess this should light a fire under my backside to
dredge up the docs describing the proper TLB flushing methods to use
in conjunction with large page extensions for the affected arches.


-- wli

2004-10-25 22:08:49

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V1 [4/4]: Numa patch

On Mon, Oct 25, 2004 at 02:25:09PM -0700, Chen, Kenneth W wrote:
>> The allocate from next best node is orthogonal to hugetlb demand paging.
>> This should be merged once all the bugs are fixed and later when demand
>> paging goes in, we can add the mpol_* stuff.

On Mon, Oct 25, 2004 at 02:52:19PM -0700, William Lee Irwin III wrote:
> I'm not too picky about this. It appears to be the 4th of the series,
> so assuming they go in in order that should meet your expectations. I
> am significantly more concerned about the flush_dcache_page() issue in
> general, though. I guess this should light a fire under my backside to
> dredge up the docs describing the proper TLB flushing methods to use
> in conjunction with large page extensions for the affected arches.

Cache flushing methods.


-- wli

2004-10-27 18:04:35

by Christoph Lameter

[permalink] [raw]
Subject: RE: Hugepages demand paging V1 [4/4]: Numa patch

On Mon, 25 Oct 2004, Chen, Kenneth W wrote:

> > @@ -32,14 +32,17 @@
> > + struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;
> > + struct zone **zones = zonelist->zones;
> > + struct zone *z;
> > + int i;
> > +
> > + for(i=0; (z = zones[i])!= NULL; i++) {
> > + nid = z->zone_pgdat->node_id;
> > + if (list_empty(&hugepage_freelists[node_id]))
> > + break;
> > }
>
> Also this is generic code, we should consider scanning ZONE_HIGHMEM
> zonelist. Otherwise, this will likely screw up x86 numa machine.

The highmem zones are included in the zones[] array AFAIK.

2004-10-27 20:58:17

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Hugepages demand paging V1 [4/4]: Numa patch

On Mon, 25 Oct 2004, Chen, Kenneth W wrote:
> > @@ -32,14 +32,17 @@
> > + struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;
> > + struct zone **zones = zonelist->zones;
> > + struct zone *z;
> > + int i;
> > +
> > + for(i=0; (z = zones[i])!= NULL; i++) {
> > + nid = z->zone_pgdat->node_id;
> > + if (list_empty(&hugepage_freelists[node_id]))
> > + break;
> > }
>
> Also this is generic code, we should consider scanning ZONE_HIGHMEM
> zonelist. Otherwise, this will likely screw up x86 numa machine.

Christoph Lameter wrote on Wednesday, October 27, 2004 10:57 AM
> The highmem zones are included in the zones[] array AFAIK.


node_zonelists is an array in the struct pglist_data. In your patch,
you are referencing the first element in that array, which has a zone
list for all node memory in normal zone.

What will happen for a x86 numa box with highmem only on some nodes?

- Ken