2004-10-26 01:59:31

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V2 [0/8]: Discussion and overview

Changes from V1:
- support huge pages in flush_dcache_page on various architectures
- revised simple numa allocation
- do not include update_mmu_cache in set_huge_pte. Require huge_update_mmu_cache

This is a revised edition of the hugetlb demand page patches by
Kenneth Chen which were discussed in the following thread in August 2004

http://marc.theaimsgroup.com/?t=109171285000004&r=1&w=2

The initial post by Ken was in April in

http://marc.theaimsgroup.com/?l=linux-ia64&m=108189860401704&w=2

Hugetlb demand paging has been part of SuSE SLES 9 for awhile now and this
patchset is intended to help hugetlb demand paging also get into the official
Linux kernel. Huge pages are referred to as "compound" pages in terms of "struct page"
in the Linux kernel. The term "compund page" may be used alternatively to
huge page.

Note that this is just the second patchset and is to be seen as discussion basis.
not as a final patchset. Please review these patches. Contributions welcome in
particular to sparc64, sh and sh64 architecture support since I do not have any of those
platforms available to me.

The patchset consists of 8 patches.

1/8 Demand Paging patch. Ken's original work plus a fix that was posted later.

2/8 Avoid-overcommit patch: Also mostly the original work by Ken plus a fix that he
posted later.

3/8 Numa patch: Make the huge page allocator try to allocate local memory.

4/8 ia64 arch modifications

5/8 i386 arch modifications

6/8 sparc64 arch modifications (untested!)

7/8 sh64 arch modifications (untested!)

8/8 sh arch modifications (untested!)

Open issues:
- memory policy for numa alloc is only available in mempolicy.c and not in hugetlb.c
If hugepage allocation needs to follow mempolicy then we need additional stuff
in mempolicy.c exported (defer for now).

- Do other arch specific functions need to be aware of compound pages for
this to work?

- Clearing hugetlb pages is time consuming using clear_highpage in alloc_huge_page.
Make it possible to use hw assist via DMA or so there?

- sparc64 arch code needs to be tested

- sh64 code needs to be fixed up and tested

- sh code needs to be fixed up and tested


2004-10-26 01:59:30

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V2 [3/8]: simple numa compatible allocator

Changelog
* Simple NUMA compatible allocation of hugepages in the nearest node

Index: linux-2.6.9/mm/hugetlb.c
===================================================================
--- linux-2.6.9.orig/mm/hugetlb.c 2004-10-22 13:28:27.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c 2004-10-25 16:56:22.000000000 -0700
@@ -32,14 +32,17 @@
{
int nid = numa_node_id();
struct page *page = NULL;
-
- if (list_empty(&hugepage_freelists[nid])) {
- for (nid = 0; nid < MAX_NUMNODES; ++nid)
- if (!list_empty(&hugepage_freelists[nid]))
- break;
+ struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;
+ struct zone **zones = zonelist->zones;
+ struct zone *z;
+ int i;
+
+ for(i=0; (z = zones[i])!= NULL; i++) {
+ nid = z->zone_pgdat->node_id;
+ if (!list_empty(&hugepage_freelists[nid]))
+ break;
}
- if (nid >= 0 && nid < MAX_NUMNODES &&
- !list_empty(&hugepage_freelists[nid])) {
+ if (z) {
page = list_entry(hugepage_freelists[nid].next,
struct page, lru);
list_del(&page->lru);

2004-10-26 02:04:06

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V2 [4/8]: ia64 arch modifications

Changelog
* Provide huge_update_mmu_cache that flushes an huge page from the icache if necessary
* flush_dcache_page does nothing on ia64 and thus does not need to be extended
* Built and tested

Index: linux-2.6.9/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/mm/init.c 2004-10-21 12:01:21.000000000 -0700
+++ linux-2.6.9/arch/ia64/mm/init.c 2004-10-25 15:32:35.000000000 -0700
@@ -95,6 +95,26 @@
set_bit(PG_arch_1, &page->flags); /* mark page as clean */
}

+void
+huge_update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte)
+{
+ unsigned long addr;
+ struct page *page;
+
+ if (!pte_exec(pte))
+ return; /* not an executable page... */
+
+ page = pte_page(pte);
+ /* don't use VADDR: it may not be mapped on this CPU (or may have just been flushed): */
+ addr = (unsigned long) page_address(page);
+
+ if (test_bit(PG_arch_1, &page->flags))
+ return; /* i-cache is already coherent with d-cache */
+
+ flush_icache_range(addr, addr + HPAGE_SIZE);
+ set_bit(PG_arch_1, &page->flags); /* mark page as clean */
+}
+
inline void
ia64_set_rbs_bot (void)
{
Index: linux-2.6.9/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgtable.h 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgtable.h 2004-10-25 15:31:39.000000000 -0700
@@ -481,7 +481,8 @@
* information. However, we use this routine to take care of any (delayed) i-cache
* flushing that may be necessary.
*/
-extern void update_mmu_cache (struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
+extern void update_mmu_cache(struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);
+extern void huge_update_mmu_cache(struct vm_area_struct *vma, unsigned long vaddr, pte_t pte);

#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*

2004-10-26 02:04:06

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V2 [2/8]: allocation control

Changelog
* hugetlb memory allocation control

Index: linux-2.6.9/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.9.orig/fs/hugetlbfs/inode.c 2004-10-21 14:50:14.000000000 -0700
+++ linux-2.6.9/fs/hugetlbfs/inode.c 2004-10-21 20:02:23.000000000 -0700
@@ -32,6 +32,206 @@
/* some random number */
#define HUGETLBFS_MAGIC 0x958458f6

+/* Convert loff_t and PAGE_SIZE counts to hugetlb page counts. */
+#define VMACCT(x) ((x) >> (HPAGE_SHIFT))
+#define VMACCTPG(x) ((x) >> (HPAGE_SHIFT - PAGE_SHIFT))
+
+static long hugetlbzone_resv;
+static spinlock_t hugetlbfs_lock = SPIN_LOCK_UNLOCKED;
+
+int hugetlb_acct_memory(long delta)
+{
+ int ret = 0;
+
+ spin_lock(&hugetlbfs_lock);
+ if (delta > 0 && (hugetlbzone_resv + delta) >
+ VMACCTPG(hugetlb_total_pages()))
+ ret = -ENOMEM;
+ else
+ hugetlbzone_resv += delta;
+ spin_unlock(&hugetlbfs_lock);
+ return ret;
+}
+
+struct file_region {
+ struct list_head link;
+ long from;
+ long to;
+};
+
+static int region_add(struct list_head *head, int f, int t)
+{
+ struct file_region *rg;
+ struct file_region *nrg;
+ struct file_region *trg;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* Add a new region if the existing region starts above our end.
+ * We should already have a space to record. */
+ if (&rg->link == head || t < rg->from)
+ BUG();
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (f > rg->from)
+ f = rg->from;
+
+ /* Check for and consume any regions we now overlap with. */
+ nrg = rg;
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > t)
+ break;
+
+ /* If this area reaches higher then extend our area to
+ * include it completely. If this is not the first area
+ * which we intend to reuse, free it. */
+ if (rg->to > t)
+ t = rg->to;
+ if (rg != nrg) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ }
+ nrg->from = f;
+ nrg->to = t;
+ return 0;
+}
+
+static int region_chg(struct list_head *head, int f, int t)
+{
+ struct file_region *rg;
+ struct file_region *nrg;
+ loff_t chg = 0;
+
+ /* Locate the region we are before or in. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* If we are below the current region then a new region is required.
+ * Subtle, allocate a new region at the position but make it zero
+ * size such that we can guarentee to record the reservation. */
+ if (&rg->link == head || t < rg->from) {
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ if (nrg == 0)
+ return -ENOMEM;
+ nrg->from = f;
+ nrg->to = f;
+ INIT_LIST_HEAD(&nrg->link);
+ list_add(&nrg->link, rg->link.prev);
+
+ return t - f;
+ }
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (f > rg->from)
+ f = rg->from;
+ chg = t - f;
+
+ /* Check for and consume any regions we now overlap with. */
+ list_for_each_entry(rg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > t)
+ return chg;
+
+ /* We overlap with this area, if it extends futher than
+ * us then we must extend ourselves. Account for its
+ * existing reservation. */
+ if (rg->to > t) {
+ chg += rg->to - t;
+ t = rg->to;
+ }
+ chg -= rg->to - rg->from;
+ }
+ return chg;
+}
+
+static int region_truncate(struct list_head *head, int end)
+{
+ struct file_region *rg;
+ struct file_region *trg;
+ int chg = 0;
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (end <= rg->to)
+ break;
+ if (&rg->link == head)
+ return 0;
+
+ /* If we are in the middle of a region then adjust it. */
+ if (end > rg->from) {
+ chg = rg->to - end;
+ rg->to = end;
+ rg = list_entry(rg->link.next, typeof(*rg), link);
+ }
+
+ /* Drop any remaining regions. */
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ chg += rg->to - rg->from;
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ return chg;
+}
+
+#if 0
+static int region_dump(struct list_head *head)
+{
+ struct file_region *rg;
+
+ list_for_each_entry(rg, head, link)
+ printk(KERN_WARNING "rg<%p> f<%lld> t<%lld>\n",
+ rg, rg->from, rg->to);
+ return 0;
+}
+#endif
+
+/* Calculate the commitment change that this mapping implies
+ * and check it against both the commitment and quota limits. */
+static int hugetlb_acct_commit(struct inode *inode, int from, int to)
+{
+ int chg;
+ int ret;
+
+ chg = region_chg(&inode->i_mapping->private_list, from, to);
+ if (chg < 0)
+ return chg;
+ ret = hugetlb_acct_memory(chg);
+ if (ret < 0)
+ return ret;
+ ret = hugetlb_get_quota(inode->i_mapping, chg);
+ if (ret < 0)
+ goto undo_commit;
+ ret = region_add(&inode->i_mapping->private_list, from, to);
+ return ret;
+
+undo_commit:
+ hugetlb_acct_memory(-chg);
+ return ret;
+}
+static void hugetlb_acct_release(struct inode *inode, int to)
+{
+ int chg;
+
+ chg = region_truncate(&inode->i_mapping->private_list, to);
+ hugetlb_acct_memory(-chg);
+ hugetlb_put_quota(inode->i_mapping, chg);
+}
+
+int hugetlbfs_report_meminfo(char *buf)
+{
+ return sprintf(buf, "HugePages_Reserved: %5lu\n", hugetlbzone_resv);
+}
+
static struct super_operations hugetlbfs_ops;
static struct address_space_operations hugetlbfs_aops;
struct file_operations hugetlbfs_file_operations;
@@ -48,7 +248,6 @@
static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_dentry->d_inode;
- struct address_space *mapping = inode->i_mapping;
loff_t len, vma_len;
int ret;

@@ -79,7 +278,10 @@
if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
goto out;

- if (inode->i_size < len)
+ ret = hugetlb_acct_commit(inode, VMACCTPG(vma->vm_pgoff),
+ VMACCTPG(vma->vm_pgoff + (vma_len >> PAGE_SHIFT)));
+
+ if (ret >= 0 && inode->i_size < len)
inode->i_size = len;
out:
up(&inode->i_sem);
@@ -194,7 +396,6 @@
++next;
truncate_huge_page(page);
unlock_page(page);
- hugetlb_put_quota(mapping);
}
huge_pagevec_release(&pvec);
}
@@ -213,6 +414,7 @@

if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);
+ hugetlb_acct_release(inode, 0);

security_inode_delete(inode);

@@ -254,6 +456,7 @@
spin_unlock(&inode_lock);
if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);
+ hugetlb_acct_release(inode, 0);

if (sbinfo->free_inodes >= 0) {
spin_lock(&sbinfo->stat_lock);
@@ -324,6 +527,7 @@
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
spin_unlock(&mapping->i_mmap_lock);
truncate_hugepages(mapping, offset);
+ hugetlb_acct_release(inode, VMACCT(offset));
return 0;
}

@@ -378,6 +582,7 @@
inode->i_blocks = 0;
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
+ INIT_LIST_HEAD(&inode->i_mapping->private_list);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
info = HUGETLBFS_I(inode);
mpol_shared_policy_init(&info->policy);
@@ -669,15 +874,15 @@
return -ENOMEM;
}

-int hugetlb_get_quota(struct address_space *mapping)
+int hugetlb_get_quota(struct address_space *mapping, int blocks)
{
int ret = 0;
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);

if (sbinfo->free_blocks > -1) {
spin_lock(&sbinfo->stat_lock);
- if (sbinfo->free_blocks > 0)
- sbinfo->free_blocks--;
+ if (sbinfo->free_blocks >= blocks)
+ sbinfo->free_blocks -= blocks;
else
ret = -ENOMEM;
spin_unlock(&sbinfo->stat_lock);
@@ -686,13 +891,13 @@
return ret;
}

-void hugetlb_put_quota(struct address_space *mapping)
+void hugetlb_put_quota(struct address_space *mapping, int blocks)
{
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);

if (sbinfo->free_blocks > -1) {
spin_lock(&sbinfo->stat_lock);
- sbinfo->free_blocks++;
+ sbinfo->free_blocks += blocks;
spin_unlock(&sbinfo->stat_lock);
}
}
@@ -745,9 +950,6 @@
if (!can_do_hugetlb_shm())
return ERR_PTR(-EPERM);

- if (!is_hugepage_mem_enough(size))
- return ERR_PTR(-ENOMEM);
-
if (!user_shm_lock(size, current->user))
return ERR_PTR(-ENOMEM);

@@ -779,6 +981,14 @@
file->f_mapping = inode->i_mapping;
file->f_op = &hugetlbfs_file_operations;
file->f_mode = FMODE_WRITE | FMODE_READ;
+
+ /* Account for the memory usage for this segment at create time.
+ * This maintains the commit on shmget() semantics of normal
+ * shared memory segments. */
+ error = hugetlb_acct_commit(inode, 0, VMACCT(size));
+ if (error < 0)
+ goto out_file;
+
return file;

out_file:
Index: linux-2.6.9/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.9.orig/fs/proc/proc_misc.c 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/fs/proc/proc_misc.c 2004-10-21 20:01:09.000000000 -0700
@@ -235,6 +235,7 @@
vmi.largest_chunk
);

+ len += hugetlbfs_report_meminfo(page + len);
len += hugetlb_report_meminfo(page + len);

return proc_calc_metrics(page, start, off, count, eof, len);
Index: linux-2.6.9/include/linux/hugetlb.h
===================================================================
--- linux-2.6.9.orig/include/linux/hugetlb.h 2004-10-21 14:50:14.000000000 -0700
+++ linux-2.6.9/include/linux/hugetlb.h 2004-10-21 20:01:09.000000000 -0700
@@ -122,8 +122,8 @@
extern struct file_operations hugetlbfs_file_operations;
extern struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_zero_setup(size_t);
-int hugetlb_get_quota(struct address_space *mapping);
-void hugetlb_put_quota(struct address_space *mapping);
+int hugetlb_get_quota(struct address_space *mapping, int blocks);
+void hugetlb_put_quota(struct address_space *mapping, int blocks);

static inline int is_file_hugepages(struct file *file)
{
@@ -134,11 +134,14 @@
{
file->f_op = &hugetlbfs_file_operations;
}
+int hugetlbfs_report_meminfo(char *);
+
#else /* !CONFIG_HUGETLBFS */

#define is_file_hugepages(file) 0
#define set_file_hugepages(file) BUG()
#define hugetlb_zero_setup(size) ERR_PTR(-ENOSYS)
+#define hugetlbfs_report_meminfo(buf) 0

#endif /* !CONFIG_HUGETLBFS */


2004-10-26 02:04:05

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V2 [6/8]: sparc64 arch modifications

Changelog
* Extend update_mmu_cache to handle compound pages
* Extend flush_dcache_page to handle compound pages
* Not built and not tested

Index: linux-2.6.9/include/asm-sparc64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-sparc64/pgtable.h 2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-sparc64/pgtable.h 2004-10-25 16:56:34.000000000 -0700
@@ -347,6 +347,7 @@

struct vm_area_struct;
extern void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t);
+#define huge_update_mmu_cache update_mmu_cache

/* Make a non-present pseudo-TTE. */
static inline pte_t mk_pte_io(unsigned long page, pgprot_t prot, int space)
Index: linux-2.6.9/arch/sparc64/mm/init.c
===================================================================
--- linux-2.6.9.orig/arch/sparc64/mm/init.c 2004-10-18 14:53:50.000000000 -0700
+++ linux-2.6.9/arch/sparc64/mm/init.c 2004-10-25 17:32:43.000000000 -0700
@@ -192,6 +192,30 @@
: "g5", "g7");
}

+static int flush_dcache_pages(struct page *page) {
+ int nr;
+
+ if (!PageCompound(page))
+ return flush_dcache_page_impl(page);
+
+ page = page->private;
+ nr = 1 << page[1].index;
+ while (nr-- >0)
+ flush_dcache_impl(page++)
+}
+
+static int smp_flush_dcache_pages(struct page *page, int cpu) {
+ int nr;
+
+ if (!PageCompound(page))
+ return smp_flush_dcache_page_impl(page, cpu);
+
+ page = page->private;
+ nr = 1 << page[1].index;
+ while (nr-- >0)
+ smp_flush_dcache_impl(page++, cpu)
+}
+
extern void __update_mmu_cache(unsigned long mmu_context_hw, unsigned long address, pte_t pte, int code);

void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t pte)
@@ -211,17 +235,32 @@
* in the SMP case.
*/
if (cpu == this_cpu)
- flush_dcache_page_impl(page);
+ flush_dcache_pages(page);
else
- smp_flush_dcache_page_impl(page, cpu);
+ smp_flush_dcache_pages(page, cpu);

clear_dcache_dirty_cpu(page, cpu);

put_cpu();
}
- if (get_thread_fault_code())
- __update_mmu_cache(vma->vm_mm->context & TAG_CONTEXT_BITS,
+ if (get_thread_fault_code()) {
+
+ if (PageCompound(page)) {
+ int nr;
+
+ page = page->private;
+ nr = 1 << page[1].index;
+ address = page_address(page);
+ while (nr-- > 0) {
+ __update_mmu_cache(vma->vm_mm->context & TAG_CONTEXT_BITS,
+ address, pte, get_thread_fault_code());
+ address += PAGE_SIZE;
+ }
+ } else
+ __update_mmu_cache(vma->vm_mm->context & TAG_CONTEXT_BITS,
address, pte, get_thread_fault_code());
+
+ }
}

void flush_dcache_page(struct page *page)
@@ -235,9 +274,18 @@
if (dirty) {
if (dirty_cpu == this_cpu)
goto out;
- smp_flush_dcache_page_impl(page, dirty_cpu);
+ smp_flush_dcache_pages(page, dirty_cpu);
}
- set_dcache_dirty(page, this_cpu);
+
+ if (PageCompound(page)) {
+ int nr=1;
+
+ page = page->private;
+ nr = 1 << page[1].index;
+ while (nr-- >0)
+ set_dcache_dirty(page++, this_cpu);
+ } else
+ set_dcache_dirty(page, this_cpu);
} else {
/* We could delay the flush for the !page_mapping
* case too. But that case is for exec env/arg

2004-10-26 02:04:05

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V2 [5/8]: i386 arch modifications

Changelog
* Provide definition of huge_update_mmu_cache (i386 does nothing in
update_mmu_cache and flush_dcache_page)
* Built and tested

Index: linux-2.6.9/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-i386/pgtable.h 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/include/asm-i386/pgtable.h 2004-10-22 13:09:46.000000000 -0700
@@ -389,6 +389,7 @@
* bit at the same time.
*/
#define update_mmu_cache(vma,address,pte) do { } while (0)
+#define huge_update_mmu_cache(vma,address,pte) do { } while (0)
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
do { \


2004-10-26 02:04:02

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V2 [8/8]: sh arch specific modifications

Changelog
* Extend flush_dcache page to flush compound pages.
* Attempt to find a solution for sh3 and sh4's
huge_update_mmu_cache which is likely very wrong.
* Not built and not tested. This is the archicture
I know the least about.

Index: linux-2.6.9/include/asm-sh/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh/pgtable.h 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/include/asm-sh/pgtable.h 2004-10-25 15:05:33.000000000 -0700
@@ -249,6 +249,8 @@
struct vm_area_struct;
extern void update_mmu_cache(struct vm_area_struct * vma,
unsigned long address, pte_t pte);
+extern void huge_update_mmu_cache(struct vm_area_struct * vma,
+ unsigned long address, pte_t pte);

/* Encode and de-code a swap entry */
/*
Index: linux-2.6.9/arch/sh/mm/cache-sh4.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/cache-sh4.c 2004-10-21 12:01:21.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/cache-sh4.c 2004-10-25 15:18:53.000000000 -0700
@@ -206,13 +206,25 @@
void flush_dcache_page(struct page *page)
{
if (test_bit(PG_mapped, &page->flags)) {
- unsigned long phys = PHYSADDR(page_address(page));
+ unsigned long phys;
+ int nr = 1;

- /* Loop all the D-cache */
- flush_cache_4096(CACHE_OC_ADDRESS_ARRAY, phys);
- flush_cache_4096(CACHE_OC_ADDRESS_ARRAY | 0x1000, phys);
- flush_cache_4096(CACHE_OC_ADDRESS_ARRAY | 0x2000, phys);
- flush_cache_4096(CACHE_OC_ADDRESS_ARRAY | 0x3000, phys);
+ if (CompoundPage(page)) {
+ page = page->private;
+ nr = 1 << page[1].index;
+ }
+
+ phys = PHYSADDR(page_address(page));
+
+ while (nr-- > 0) {
+ /* Loop all the D-cache */
+ flush_cache_4096(CACHE_OC_ADDRESS_ARRAY, phys);
+ flush_cache_4096(CACHE_OC_ADDRESS_ARRAY | 0x1000, phys);
+ flush_cache_4096(CACHE_OC_ADDRESS_ARRAY | 0x2000, phys);
+ flush_cache_4096(CACHE_OC_ADDRESS_ARRAY | 0x3000, phys);
+
+ phys += PAGE_SIZE;
+ }
}
}

Index: linux-2.6.9/arch/sh/mm/tlb-nommu.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/tlb-nommu.c 2004-10-18 14:54:27.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/tlb-nommu.c 2004-10-25 15:06:56.000000000 -0700
@@ -55,4 +55,9 @@
{
BUG();
}
+void huge_update_mmu_cache(struct vm_area_struct * vma,
+ unsigned long address, pte_t pte)
+{
+ BUG();
+}

Index: linux-2.6.9/arch/sh/mm/cache-sh7705.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/cache-sh7705.c 2004-10-21 12:01:21.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/cache-sh7705.c 2004-10-25 15:23:46.000000000 -0700
@@ -135,8 +135,22 @@
*/
void flush_dcache_page(struct page *page)
{
- if (test_bit(PG_mapped, &page->flags))
- __flush_dcache_page(PHYSADDR(page_address(page)));
+ if (test_bit(PG_mapped, &page->flags)) {
+ if (!CompoundPage(page))
+ __flush_dcache_page(PHYSADDR(page_address(page)));
+ else {
+ int nr;
+ unsigned long phy = PHYSADDR(page_address(page));
+
+ page = page->private;
+ nr = 1 << page[1].index;
+
+ while (nr-- >0) {
+ __flush_dcache_page(phys);
+ phys += PAGE_SIZE;
+ }
+ }
+ }
}

void flush_cache_all(void)
Index: linux-2.6.9/arch/sh/mm/tlb-sh3.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/tlb-sh3.c 2004-10-21 12:01:21.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/tlb-sh3.c 2004-10-25 15:13:00.000000000 -0700
@@ -67,6 +67,48 @@
local_irq_restore(flags);
}

+void huge_update_mmu_cache(struct vm_area_struct * vma,
+ unsigned long address, pte_t pte)
+{
+ unsigned long flags;
+ unsigned long pteval;
+ unsigned long vpn;
+
+ /* Ptrace may call this routine. */
+ if (vma && current->active_mm != vma->vm_mm)
+ return;
+
+#if defined(CONFIG_SH7705_CACHE_32KB)
+ struct page *page;
+ page = pte_page(pte);
+ if (VALID_PAGE(page) && !test_bit(PG_mapped, &page->flags)) {
+ unsigned long phys = pte_val(pte) & PTE_PHYS_MASK;
+ __flush_wback_region((void *)P1SEGADDR(phys), HPAGE_SIZE);
+ __set_bit(PG_mapped, &page->flags);
+ }
+#endif
+
+ local_irq_save(flags);
+
+ /* FIXME: What exactly does the code below do? pte mapping ? */
+
+ /* Set PTEH register */
+ vpn = (address & MMU_VPN_MASK) | get_asid();
+ ctrl_outl(vpn, MMU_PTEH);
+
+ pteval = pte_val(pte);
+
+ /* Set PTEL register */
+ pteval &= _PAGE_FLAGS_HARDWARE_MASK; /* drop software flags */
+ /* conveniently, we want all the software flags to be 0 anyway */
+ ctrl_outl(pteval, MMU_PTEL);
+
+ /* Load the TLB */
+ asm volatile("ldtlb": /* no output */ : /* no input */ : "memory");
+ local_irq_restore(flags);
+}
+
+
void __flush_tlb_page(unsigned long asid, unsigned long page)
{
unsigned long addr, data;
Index: linux-2.6.9/arch/sh/mm/tlb-sh4.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/tlb-sh4.c 2004-10-18 14:54:38.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/tlb-sh4.c 2004-10-25 15:14:46.000000000 -0700
@@ -28,6 +28,56 @@
#include <asm/mmu_context.h>
#include <asm/cacheflush.h>

+void huge_update_mmu_cache(struct vm_area_struct * vma,
+ unsigned long address, pte_t pte)
+{
+ unsigned long flags;
+ unsigned long pteval;
+ unsigned long vpn;
+ struct page *page;
+ unsigned long pfn;
+ unsigned long ptea;
+
+ /* Ptrace may call this routine. */
+ if (vma && current->active_mm != vma->vm_mm)
+ return;
+
+ pfn = pte_pfn(pte);
+ if (pfn_valid(pfn)) {
+ page = pfn_to_page(pfn);
+ if (!test_bit(PG_mapped, &page->flags)) {
+ unsigned long phys = pte_val(pte) & PTE_PHYS_MASK;
+ __flush_wback_region((void *)P1SEGADDR(phys), HPAGE_SIZE);
+ __set_bit(PG_mapped, &page->flags);
+ }
+ }
+
+ /* FIXME: code below likely needs to be fixed up for huge pages */
+ local_irq_save(flags);
+
+ /* Set PTEH register */
+ vpn = (address & MMU_VPN_MASK) | get_asid();
+ ctrl_outl(vpn, MMU_PTEH);
+
+ pteval = pte_val(pte);
+ /* Set PTEA register */
+ /* TODO: make this look less hacky */
+ ptea = ((pteval >> 28) & 0xe) | (pteval & 0x1);
+ ctrl_outl(ptea, MMU_PTEA);
+
+ /* Set PTEL register */
+ pteval &= _PAGE_FLAGS_HARDWARE_MASK; /* drop software flags */
+#ifdef CONFIG_SH_WRITETHROUGH
+ pteval |= _PAGE_WT;
+#endif
+ /* conveniently, we want all the software flags to be 0 anyway */
+ ctrl_outl(pteval, MMU_PTEL);
+
+ /* Load the TLB */
+ asm volatile("ldtlb": /* no output */ : /* no input */ : "memory");
+ local_irq_restore(flags);
+}
+
void update_mmu_cache(struct vm_area_struct * vma,
unsigned long address, pte_t pte)
{

2004-10-26 03:20:20

by Jesse Barnes

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Monday, October 25, 2004 7:23 pm, William Lee Irwin III wrote:
> On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
> > - Clearing hugetlb pages is time consuming using clear_highpage in
> > alloc_huge_page. Make it possible to use hw assist via DMA or so there?
>
> It's possible, but it's been found not to be useful. What has been found
> useful is assistance from much lower-level memory hardware of a kind
> not to be had in any extant mass-manufactured machines.

Do you have examples? SGI hardware has a so-called 'BTE' (for Block Transfer
Engine) that can arbitrarily zero or copy pages w/o CPU assistance. It's
builtin to the memory controller. Using it to zero the pages has the
advantages of being asyncrhonous and not hosing the CPU cache.

Jesse

2004-10-26 02:56:12

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
>>> - Clearing hugetlb pages is time consuming using clear_highpage in
>>> alloc_huge_page. Make it possible to use hw assist via DMA or so there?

On Monday, October 25, 2004 7:23 pm, William Lee Irwin III wrote:
>> It's possible, but it's been found not to be useful. What has been found
>> useful is assistance from much lower-level memory hardware of a kind
>> not to be had in any extant mass-manufactured machines.

On Mon, Oct 25, 2004 at 07:40:30PM -0700, Jesse Barnes wrote:
> Do you have examples? SGI hardware has a so-called 'BTE' (for Block Transfer
> Engine) that can arbitrarily zero or copy pages w/o CPU assistance. It's
> builtin to the memory controller. Using it to zero the pages has the
> advantages of being asyncrhonous and not hosing the CPU cache.

That's the same kind of thing, so it apparently has been
mass-manufactured.


-- wli

2004-10-26 02:40:11

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V2 [1/8]: hugetlb fault handler

ChangeLog
* provide huge page fault handler and related things

Index: linux-2.6.9/arch/i386/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/i386/mm/hugetlbpage.c 2004-10-21 12:01:21.000000000 -0700
+++ linux-2.6.9/arch/i386/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
@@ -18,13 +18,26 @@
#include <asm/tlb.h>
#include <asm/tlbflush.h>

-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+static void scrub_one_pmd(pmd_t * pmd)
+{
+ struct page *page;
+
+ if (pmd && !pmd_none(*pmd) && !pmd_huge(*pmd)) {
+ page = pmd_page(*pmd);
+ pmd_clear(pmd);
+ dec_page_state(nr_page_table_pages);
+ page_cache_release(page);
+ }
+}
+
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pmd_t *pmd = NULL;

pgd = pgd_offset(mm, addr);
pmd = pmd_alloc(mm, pgd, addr);
+ scrub_one_pmd(pmd);
return (pte_t *) pmd;
}

@@ -34,11 +47,14 @@
pmd_t *pmd = NULL;

pgd = pgd_offset(mm, addr);
- pmd = pmd_offset(pgd, addr);
+ if (pgd_present(*pgd)) {
+ pmd = pmd_offset(pgd, addr);
+ scrub_one_pmd(pmd);
+ }
return (pte_t *) pmd;
}

-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page, pte_t * page_table, int write_access)
+void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page, pte_t * page_table, int write_access)
{
pte_t entry;

@@ -73,17 +89,18 @@
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;

- while (addr < end) {
+ for (; addr < end; addr+= HPAGE_SIZE) {
+ src_pte = huge_pte_offset(src, addr);
+ if (!src_pte || pte_none(*src_pte))
+ continue;
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
- src_pte = huge_pte_offset(src, addr);
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
set_pte(dst_pte, entry);
dst->rss += (HPAGE_SIZE / PAGE_SIZE);
- addr += HPAGE_SIZE;
}
return 0;

@@ -217,68 +234,8 @@
continue;
page = pte_page(pte);
put_page(page);
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
}

-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- BUG_ON(vma->vm_start & ~HPAGE_MASK);
- BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
-
- if (!pte_none(*pte)) {
- pmd_t *pmd = (pmd_t *) pte;
-
- page = pmd_page(*pmd);
- pmd_clear(pmd);
- mm->nr_ptes--;
- dec_page_state(nr_page_table_pages);
- page_cache_release(page);
- }
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
- }
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
-}
Index: linux-2.6.9/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/ia64/mm/hugetlbpage.c 2004-10-18 14:54:27.000000000 -0700
+++ linux-2.6.9/arch/ia64/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
@@ -24,7 +24,7 @@

unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;

-static pte_t *
+pte_t *
huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
{
unsigned long taddr = htlbpage_to_page(addr);
@@ -59,7 +59,7 @@

#define mk_pte_huge(entry) { pte_val(entry) |= _PAGE_P; }

-static void
+void
set_huge_pte (struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page, pte_t * page_table, int write_access)
{
@@ -99,17 +99,18 @@
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;

- while (addr < end) {
+ for (; addr < end; addr += HPAGE_SIZE) {
+ src_pte = huge_pte_offset(src, addr);
+ if (!src_pte || pte_none(*src_pte))
+ continue;
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
- src_pte = huge_pte_offset(src, addr);
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
set_pte(dst_pte, entry);
dst->rss += (HPAGE_SIZE / PAGE_SIZE);
- addr += HPAGE_SIZE;
}
return 0;
nomem:
@@ -243,69 +244,16 @@

for (address = start; address < end; address += HPAGE_SIZE) {
pte = huge_pte_offset(mm, address);
- if (pte_none(*pte))
+ if (!pte || pte_none(*pte))
continue;
page = pte_page(*pte);
put_page(page);
pte_clear(pte);
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
}

-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- BUG_ON(vma->vm_start & ~HPAGE_MASK);
- BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
- if (!pte_none(*pte))
- continue;
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- page_cache_release(page);
- goto out;
- }
- }
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
-}
-
unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
{
Index: linux-2.6.9/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/ppc64/mm/hugetlbpage.c 2004-10-21 12:01:21.000000000 -0700
+++ linux-2.6.9/arch/ppc64/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
@@ -408,66 +408,9 @@
pte, local);

put_page(page);
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
put_cpu();
-
- mm->rss -= (end - start) >> PAGE_SHIFT;
-}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- WARN_ON(!is_vm_hugetlb_page(vma));
- BUG_ON((vma->vm_start % HPAGE_SIZE) != 0);
- BUG_ON((vma->vm_end % HPAGE_SIZE) != 0);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- hugepte_t *pte = hugepte_alloc(mm, addr);
- struct page *page;
-
- BUG_ON(!in_hugepage_area(mm->context, addr));
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
- if (!hugepte_none(*pte))
- continue;
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
- }
- setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
}

/* Because we have an exclusive hugepage region which lies within the
@@ -863,3 +806,59 @@

ppc_md.hpte_invalidate(slot, va, 1, local);
}
+
+int
+handle_hugetlb_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+ unsigned long addr, int write_access)
+{
+ hugepte_t *pte;
+ struct page *page;
+ struct address_space *mapping;
+ int idx, ret;
+
+ spin_lock(&mm->page_table_lock);
+ pte = hugepte_alloc(mm, addr & HPAGE_MASK);
+ if (!pte)
+ goto oom;
+ if (!hugepte_none(*pte))
+ goto out;
+ spin_unlock(&mm->page_table_lock);
+
+ mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+ idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
+ + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+retry:
+ page = find_get_page(mapping, idx);
+ if (!page) {
+ page = alloc_huge_page();
+ if (!page)
+ /*
+ * with strict overcommit accounting, we should never
+ * run out of hugetlb page, so must be a fault race
+ * and let's retry.
+ */
+ goto retry;
+ ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+ if (!ret) {
+ unlock_page(page);
+ } else {
+ put_page(page);
+ if (ret == -EEXIST)
+ goto retry;
+ else
+ return VM_FAULT_OOM;
+ }
+ }
+
+ spin_lock(&mm->page_table_lock);
+ if (hugepte_none(*pte))
+ setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE);
+ else
+ put_page(page);
+out:
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_MINOR;
+oom:
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_OOM;
+}
Index: linux-2.6.9/arch/sh/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/sh/mm/hugetlbpage.c 2004-10-18 14:54:32.000000000 -0700
+++ linux-2.6.9/arch/sh/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
@@ -24,7 +24,7 @@
#include <asm/tlbflush.h>
#include <asm/cacheflush.h>

-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pmd_t *pmd;
@@ -56,7 +56,7 @@

#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)

-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page, pte_t * page_table, int write_access)
{
unsigned long i;
@@ -101,12 +101,13 @@
unsigned long end = vma->vm_end;
int i;

- while (addr < end) {
+ for (; addr < end; addr += HPAGE_SIZE) {
+ src_pte = huge_pte_offset(src, addr);
+ if (!src_pte || pte_none(*src_pte))
+ continue;
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
- src_pte = huge_pte_offset(src, addr);
- BUG_ON(!src_pte || pte_none(*src_pte));
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
@@ -116,7 +117,6 @@
dst_pte++;
}
dst->rss += (HPAGE_SIZE / PAGE_SIZE);
- addr += HPAGE_SIZE;
}
return 0;

@@ -196,8 +196,7 @@

for (address = start; address < end; address += HPAGE_SIZE) {
pte = huge_pte_offset(mm, address);
- BUG_ON(!pte);
- if (pte_none(*pte))
+ if (!pte || pte_none(*pte))
continue;
page = pte_page(*pte);
put_page(page);
@@ -205,60 +204,7 @@
pte_clear(pte);
pte++;
}
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- BUG_ON(vma->vm_start & ~HPAGE_MASK);
- BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
- if (!pte_none(*pte))
- continue;
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
- }
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
-}
Index: linux-2.6.9/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.9.orig/arch/sparc64/mm/hugetlbpage.c 2004-10-18 14:54:38.000000000 -0700
+++ linux-2.6.9/arch/sparc64/mm/hugetlbpage.c 2004-10-21 20:02:52.000000000 -0700
@@ -21,7 +21,7 @@
#include <asm/tlbflush.h>
#include <asm/cacheflush.h>

-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pmd_t *pmd;
@@ -53,7 +53,7 @@

#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)

-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page, pte_t * page_table, int write_access)
{
unsigned long i;
@@ -98,12 +98,13 @@
unsigned long end = vma->vm_end;
int i;

- while (addr < end) {
+ for (; addr < end; addr += HPAGE_SIZE) {
+ src_pte = huge_pte_offset(src, addr);
+ if (!src_pte || pte_none(*src_pte))
+ continue;
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
- src_pte = huge_pte_offset(src, addr);
- BUG_ON(!src_pte || pte_none(*src_pte));
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
@@ -113,7 +114,6 @@
dst_pte++;
}
dst->rss += (HPAGE_SIZE / PAGE_SIZE);
- addr += HPAGE_SIZE;
}
return 0;

@@ -193,8 +193,7 @@

for (address = start; address < end; address += HPAGE_SIZE) {
pte = huge_pte_offset(mm, address);
- BUG_ON(!pte);
- if (pte_none(*pte))
+ if (!pte || pte_none(*pte))
continue;
page = pte_page(*pte);
put_page(page);
@@ -202,60 +201,7 @@
pte_clear(pte);
pte++;
}
+ mm->rss -= (HPAGE_SIZE / PAGE_SIZE);
}
- mm->rss -= (end - start) >> PAGE_SHIFT;
flush_tlb_range(vma, start, end);
}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- BUG_ON(vma->vm_start & ~HPAGE_MASK);
- BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
- if (!pte_none(*pte))
- continue;
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
- }
- set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
-}
Index: linux-2.6.9/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.9.orig/fs/hugetlbfs/inode.c 2004-10-18 14:55:07.000000000 -0700
+++ linux-2.6.9/fs/hugetlbfs/inode.c 2004-10-21 14:50:14.000000000 -0700
@@ -79,10 +79,6 @@
if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
goto out;

- ret = hugetlb_prefault(mapping, vma);
- if (ret)
- goto out;
-
if (inode->i_size < len)
inode->i_size = len;
out:
Index: linux-2.6.9/include/linux/hugetlb.h
===================================================================
--- linux-2.6.9.orig/include/linux/hugetlb.h 2004-10-18 14:54:08.000000000 -0700
+++ linux-2.6.9/include/linux/hugetlb.h 2004-10-21 14:50:14.000000000 -0700
@@ -17,7 +17,10 @@
int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
void zap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
-int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
+pte_t *huge_pte_alloc(struct mm_struct *, unsigned long);
+void set_huge_pte(struct mm_struct *, struct vm_area_struct *, struct page *, pte_t *, int);
+int handle_hugetlb_mm_fault(struct mm_struct *, struct vm_area_struct *, unsigned long, int);
+
int hugetlb_report_meminfo(char *);
int hugetlb_report_node_meminfo(int, char *);
int is_hugepage_mem_enough(size_t);
@@ -61,7 +64,7 @@
#define follow_hugetlb_page(m,v,p,vs,a,b,i) ({ BUG(); 0; })
#define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
#define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
-#define hugetlb_prefault(mapping, vma) ({ BUG(); 0; })
+#define handle_hugetlb_mm_fault(mm, vma, addr, write) VM_FAULT_SIGBUS
#define zap_hugepage_range(vma, start, len) BUG()
#define unmap_hugepage_range(vma, start, end) BUG()
#define is_hugepage_mem_enough(size) 0
Index: linux-2.6.9/mm/hugetlb.c
===================================================================
--- linux-2.6.9.orig/mm/hugetlb.c 2004-10-18 14:54:37.000000000 -0700
+++ linux-2.6.9/mm/hugetlb.c 2004-10-22 13:28:27.000000000 -0700
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/mm.h>
#include <linux/hugetlb.h>
+#include <linux/pagemap.h>
#include <linux/sysctl.h>
#include <linux/highmem.h>

@@ -231,11 +232,66 @@
}
EXPORT_SYMBOL(hugetlb_total_pages);

+int __attribute__ ((weak))
+handle_hugetlb_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+ unsigned long addr, int write_access)
+{
+ pte_t *pte;
+ struct page *page;
+ struct address_space *mapping;
+ int idx, ret;
+
+ spin_lock(&mm->page_table_lock);
+ pte = huge_pte_alloc(mm, addr & HPAGE_MASK);
+ if (!pte)
+ goto oom;
+ if (!pte_none(*pte))
+ goto out;
+ spin_unlock(&mm->page_table_lock);
+
+ mapping = vma->vm_file->f_dentry->d_inode->i_mapping;
+ idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
+ + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+retry:
+ page = find_get_page(mapping, idx);
+ if (!page) {
+ page = alloc_huge_page();
+ if (!page)
+ /*
+ * with strict overcommit accounting, we should never
+ * run out of hugetlb page, so must be a fault race
+ * and let's retry.
+ */
+ goto retry;
+ ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+ if (!ret) {
+ unlock_page(page);
+ } else {
+ put_page(page);
+ if (ret == -EEXIST)
+ goto retry;
+ else
+ return VM_FAULT_OOM;
+ }
+ }
+
+ spin_lock(&mm->page_table_lock);
+ if (pte_none(*pte)) {
+ set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+ huge_update_mmu_cache(vma, addr, *pte);
+ } else
+ put_page(page);
+out:
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_MINOR;
+oom:
+ spin_unlock(&mm->page_table_lock);
+ return VM_FAULT_OOM;
+}
+
/*
- * We cannot handle pagefaults against hugetlb pages at all. They cause
- * handle_mm_fault() to try to instantiate regular-sized pages in the
- * hugegpage VMA. do_page_fault() is supposed to trap this, so BUG is we get
- * this far.
+ * We should not get here because handle_mm_fault() is supposed to trap
+ * hugetlb page fault. BUG if we get here.
*/
static struct page *hugetlb_nopage(struct vm_area_struct *vma,
unsigned long address, int *unused)
Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/mm/memory.c 2004-10-21 14:50:14.000000000 -0700
@@ -765,11 +765,6 @@
|| !(flags & vma->vm_flags))
return i ? : -EFAULT;

- if (is_vm_hugetlb_page(vma)) {
- i = follow_hugetlb_page(mm, vma, pages, vmas,
- &start, &len, i);
- continue;
- }
spin_lock(&mm->page_table_lock);
do {
struct page *map;
@@ -1693,7 +1688,7 @@
inc_page_state(pgfault);

if (is_vm_hugetlb_page(vma))
- return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ return handle_hugetlb_mm_fault(mm, vma, address, write_access);

/*
* We need the page table lock to synchronize with kswapd

2004-10-26 02:04:03

by Christoph Lameter

[permalink] [raw]
Subject: Hugepages demand paging V2 [7/8]: sh64 arch modifications

Changelog
* Provide huge_update_mmu_cache through update_mmu_cache (which is just counting
the number of calls)
* Extend flush_dcache_page to handle compound pages
* Not build and not tested


Index: linux-2.6.9/include/asm-sh64/pgtable.h
===================================================================
--- linux-2.6.9.orig/include/asm-sh64/pgtable.h 2004-10-21 12:01:24.000000000 -0700
+++ linux-2.6.9/include/asm-sh64/pgtable.h 2004-10-25 14:55:29.000000000 -0700
@@ -462,6 +462,7 @@

extern void update_mmu_cache(struct vm_area_struct * vma,
unsigned long address, pte_t pte);
+#define huge_update_mmu_cache update_mmu_cache

/* Encode and decode a swap entry */
#define __swp_type(x) (((x).val & 3) + (((x).val >> 1) & 0x3c))
Index: linux-2.6.9/arch/sh64/mm/cache.c
===================================================================
--- linux-2.6.9.orig/arch/sh64/mm/cache.c 2004-10-25 15:02:58.000000000 -0700
+++ linux-2.6.9/arch/sh64/mm/cache.c 2004-10-25 15:03:16.000000000 -0700
@@ -990,7 +990,16 @@

void flush_dcache_page(struct page *page)
{
- sh64_dcache_purge_phy_page(page_to_phys(page));
+ if (likely(!PageCompound))
+ sh64_dcache_purge_phy_page(page_to_phys(page));
+ else {
+ int nr;
+
+ page = page->private;
+ nr = 1 << page[1].index;
+ while (nr--)
+ sh64_dcache_purge_phy_page(page_to_phys(page++));
+ }
wmb();
}


2004-10-26 02:25:15

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
> - memory policy for numa alloc is only available in mempolicy.c and
> not in hugetlb.c If hugepage allocation needs to follow mempolicy
> then we need additional stuff in mempolicy.c exported (defer for now).

Exported? hugetlb hasn't ever really successfully been made modular.


On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
> - Do other arch specific functions need to be aware of compound pages for
> this to work?

Not sure where any new dependencies would come in, or even what "this"
means in your question.


On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
> - Clearing hugetlb pages is time consuming using clear_highpage in
> alloc_huge_page. Make it possible to use hw assist via DMA or so there?

It's possible, but it's been found not to be useful. What has been found
useful is assistance from much lower-level memory hardware of a kind
not to be had in any extant mass-manufactured machines.


On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
> - sparc64 arch code needs to be tested
> - sh64 code needs to be fixed up and tested
> - sh code needs to be fixed up and tested

I'll ask the maintainers where to get sh and sh64 hardware.


-- wli

2004-10-26 14:35:58

by Robin Holt

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Mon, Oct 25, 2004 at 07:40:30PM -0700, Jesse Barnes wrote:
> On Monday, October 25, 2004 7:23 pm, William Lee Irwin III wrote:
> > On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
> > > - Clearing hugetlb pages is time consuming using clear_highpage in
> > > alloc_huge_page. Make it possible to use hw assist via DMA or so there?
> >
> > It's possible, but it's been found not to be useful. What has been found
> > useful is assistance from much lower-level memory hardware of a kind
> > not to be had in any extant mass-manufactured machines.
>
> Do you have examples? SGI hardware has a so-called 'BTE' (for Block Transfer
> Engine) that can arbitrarily zero or copy pages w/o CPU assistance. It's
> builtin to the memory controller. Using it to zero the pages has the
> advantages of being asyncrhonous and not hosing the CPU cache.
>

Jesse,

Sorry for being a stickler here, but the BTE is really part of the
I/O Interface portion of the shub. That portion has a seperate clock
frequency from the memory controller (unfortunately slower). The BTE
can zero at a slightly slower speed than the processor. It does, as
you pointed out, not trash the CPU cache.

One other feature of the BTE is it can operate asynchronously from
the cpu. This could be used to, during a clock interrupt, schedule
additional huge page zero filling on multiple nodes at the same time.
This could result in a huge speed boost on machines that have multiple
memory only nodes. That has not been tested thoroughly. We have done
considerable testing of the page zero functionality as well as the
error handling.

Robin

2004-10-26 16:45:35

by Jesse Barnes

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Tuesday, October 26, 2004 7:35 am, Robin Holt wrote:
> Sorry for being a stickler here, but the BTE is really part of the
> I/O Interface portion of the shub. That portion has a seperate clock
> frequency from the memory controller (unfortunately slower). The BTE
> can zero at a slightly slower speed than the processor. It does, as
> you pointed out, not trash the CPU cache.

I guess I was getting ahead of myself :). I knew that it was part of the II
but didn't know it had a slower clock frequency than the MD.

> One other feature of the BTE is it can operate asynchronously from
> the cpu. This could be used to, during a clock interrupt, schedule
> additional huge page zero filling on multiple nodes at the same time.
> This could result in a huge speed boost on machines that have multiple
> memory only nodes. That has not been tested thoroughly. We have done
> considerable testing of the page zero functionality as well as the
> error handling.

Might be worth some additional testing...

Jesse

2004-10-26 17:41:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Tuesday, October 26, 2004 7:35 am, Robin Holt wrote:
>> One other feature of the BTE is it can operate asynchronously from
>> the cpu. This could be used to, during a clock interrupt, schedule
>> additional huge page zero filling on multiple nodes at the same time.
>> This could result in a huge speed boost on machines that have multiple
>> memory only nodes. That has not been tested thoroughly. We have done
>> considerable testing of the page zero functionality as well as the
>> error handling.

On Tue, Oct 26, 2004 at 09:44:21AM -0700, Jesse Barnes wrote:
> Might be worth some additional testing...

And an architecture method for hugepage clearing.


-- wli

2004-10-26 17:48:11

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Tue, 26 Oct 2004, William Lee Irwin III wrote:
>> And an architecture method for hugepage clearing.

On Tue, Oct 26, 2004 at 10:45:55AM -0700, Christoph Lameter wrote:
> Add clear_huge_page to asm-generic/pgtable.h and an associated
> __HAVE_ARCH_CLEAR_HUGE_PAGE ?

Or a weak function.


-- wli

2004-10-26 17:49:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Tue, 26 Oct 2004, William Lee Irwin III wrote:

> And an architecture method for hugepage clearing.

Add clear_huge_page to asm-generic/pgtable.h and an associated
__HAVE_ARCH_CLEAR_HUGE_PAGE ?


2004-10-27 05:58:13

by David Gibson

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
> Changes from V1:
> - support huge pages in flush_dcache_page on various architectures
> - revised simple numa allocation
> - do not include update_mmu_cache in set_huge_pte. Require huge_update_mmu_cache
>
> This is a revised edition of the hugetlb demand page patches by
> Kenneth Chen which were discussed in the following thread in August 2004
>
> http://marc.theaimsgroup.com/?t=109171285000004&r=1&w=2
>
> The initial post by Ken was in April in
>
> http://marc.theaimsgroup.com/?l=linux-ia64&m=108189860401704&w=2
>
> Hugetlb demand paging has been part of SuSE SLES 9 for awhile now
> and this patchset is intended to help hugetlb demand paging also get
> into the official Linux kernel. Huge pages are referred to as
> "compound" pages in terms of "struct page" in the Linux kernel. The
> term "compund page" may be used alternatively to huge page.

I wish we could start calling this "lazy allocation" instead of
"demand paging". "Demand paging" makes people think of swapping
hugepages, or mapping files on real filesystems with hugepages, which
is not what these patches do, and probably something we don't want to
do.

--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson

2004-10-27 06:53:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
> Hugetlb demand paging has been part of SuSE SLES 9 for awhile now and
> this patchset is intended to help hugetlb demand paging also get into
> the official Linux kernel. Huge pages are referred to as "compound"
> pages in terms of "struct page" in the Linux kernel. The term
"compund page" may be used alternatively to huge page.

This may very well explain why SLES9 is triplefaulting when Oracle
tries to use hugetlb on it on x86-64.

Since all this is clearly malfunctioning and not done anywhere near
carefully enough, can I at least get *some* sanction to do any of this
differently?


-- wli

2004-10-27 14:15:37

by Ray Bryant

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

William Lee Irwin III wrote:
> On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
>
>>Hugetlb demand paging has been part of SuSE SLES 9 for awhile now and
>>this patchset is intended to help hugetlb demand paging also get into
>>the official Linux kernel. Huge pages are referred to as "compound"
>>pages in terms of "struct page" in the Linux kernel. The term
>
> "compund page" may be used alternatively to huge page.
>
> This may very well explain why SLES9 is triplefaulting when Oracle
> tries to use hugetlb on it on x86-64.
>
> Since all this is clearly malfunctioning and not done anywhere near
> carefully enough, can I at least get *some* sanction to do any of this
> differently?
>
>
> -- wli
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

How differently? What do you have in mind?

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-10-27 16:37:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Tue, 26 Oct 2004, William Lee Irwin III wrote:

> On Mon, Oct 25, 2004 at 06:26:42PM -0700, Christoph Lameter wrote:
> > Hugetlb demand paging has been part of SuSE SLES 9 for awhile now and
> > this patchset is intended to help hugetlb demand paging also get into
> > the official Linux kernel. Huge pages are referred to as "compound"
> > pages in terms of "struct page" in the Linux kernel. The term
> "compund page" may be used alternatively to huge page.
>
> This may very well explain why SLES9 is triplefaulting when Oracle
> tries to use hugetlb on it on x86-64.
>
> Since all this is clearly malfunctioning and not done anywhere near
> carefully enough, can I at least get *some* sanction to do any of this
> differently?

The current SUSE implementation is a different implementation and has
severe limitations. They need a different implementation and
the suggestion was made to start with Ken's patches.
What would you like to do differently?

2004-10-27 17:12:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Wed, 27 Oct 2004, David Gibson wrote:

> I wish we could start calling this "lazy allocation" instead of
> "demand paging". "Demand paging" makes people think of swapping
> hugepages, or mapping files on real filesystems with hugepages, which
> is not what these patches do, and probably something we don't want to
> do.

Good idea.

2004-10-27 18:13:44

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Tue, 26 Oct 2004, Robin Holt wrote:

> Sorry for being a stickler here, but the BTE is really part of the
> I/O Interface portion of the shub. That portion has a seperate clock
> frequency from the memory controller (unfortunately slower). The BTE
> can zero at a slightly slower speed than the processor. It does, as
> you pointed out, not trash the CPU cache.
>
> One other feature of the BTE is it can operate asynchronously from
> the cpu. This could be used to, during a clock interrupt, schedule
> additional huge page zero filling on multiple nodes at the same time.
> This could result in a huge speed boost on machines that have multiple
> memory only nodes. That has not been tested thoroughly. We have done
> considerable testing of the page zero functionality as well as the
> error handling.

If the huge patch would support some way of redirecting the clearing of a
huge page then we could:

1. set the huge pte to not present so that we get a fault on access
2. run the bte clearer.
3. On receiving a huge fault we could check for the bte being finished.

This would parallelize the clearing of huge pages. But is that really more
efficient? There may be complexity involved in allowing the clearing of
multiple pages and tracking of the clear in progress is additional
overhead.


2004-10-27 23:01:50

by Ray Bryant

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

Christoph Lameter wrote:
> On Tue, 26 Oct 2004, Robin Holt wrote:
>
>
>>Sorry for being a stickler here, but the BTE is really part of the
>>I/O Interface portion of the shub. That portion has a seperate clock
>>frequency from the memory controller (unfortunately slower). The BTE
>>can zero at a slightly slower speed than the processor. It does, as
>>you pointed out, not trash the CPU cache.
>>
>>One other feature of the BTE is it can operate asynchronously from
>>the cpu. This could be used to, during a clock interrupt, schedule
>>additional huge page zero filling on multiple nodes at the same time.
>>This could result in a huge speed boost on machines that have multiple
>>memory only nodes. That has not been tested thoroughly. We have done
>>considerable testing of the page zero functionality as well as the
>>error handling.
>
>
> If the huge patch would support some way of redirecting the clearing of a
> huge page then we could:
>
> 1. set the huge pte to not present so that we get a fault on access
> 2. run the bte clearer.
> 3. On receiving a huge fault we could check for the bte being finished.
>
> This would parallelize the clearing of huge pages. But is that really more
> efficient? There may be complexity involved in allowing the clearing of
> multiple pages and tracking of the clear in progress is additional
> overhead.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

I'm personally of the opinion that using the BTE to "speculatively" clear
hugetlb pages in advance of when the hugetlb pages are requested is not a good
thing [tm]. One never knows if those pages will ever be requested. And in
the meantime, tasks that need the BTE will be delayed by speculative use.
But that is a personal bias :-), with no data to back it up.

AFAIK, it is faster to clear the page with the processor anyway, since the
processor has a faster clock cycle. Yes, it destroys the processor cache,
but the application has clearly indicated that it wants the page NOW, please,
(because it has faulted on it), and delivering the page to the application
as quickly as possible sounds like a good thing. I'm not sure reloading
the processor cache at this point is a cost we care about, given that the
application is likely just starting up anyway. I figure hugetlb pages are
allocated once, stay around a long long time, so I'm not sure optimizing to
minimize cache damage is the correct way to go here.

The only obvious win is for memory only nodes, that have a BTE and no CPU.
It is probably faster to use the local BTE than a remote CPU to clear the page.

Does that make any sense?

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-10-27 23:07:29

by Ray Bryant

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

Christoph Lameter wrote:
> On Tue, 26 Oct 2004, Robin Holt wrote:
>
>
>>Sorry for being a stickler here, but the BTE is really part of the
>>I/O Interface portion of the shub. That portion has a seperate clock
>>frequency from the memory controller (unfortunately slower). The BTE
>>can zero at a slightly slower speed than the processor. It does, as
>>you pointed out, not trash the CPU cache.
>>
>>One other feature of the BTE is it can operate asynchronously from
>>the cpu. This could be used to, during a clock interrupt, schedule
>>additional huge page zero filling on multiple nodes at the same time.
>>This could result in a huge speed boost on machines that have multiple
>>memory only nodes. That has not been tested thoroughly. We have done
>>considerable testing of the page zero functionality as well as the
>>error handling.
>
>
> If the huge patch would support some way of redirecting the clearing of a
> huge page then we could:
>
> 1. set the huge pte to not present so that we get a fault on access
> 2. run the bte clearer.
> 3. On receiving a huge fault we could check for the bte being finished.
>
> This would parallelize the clearing of huge pages. But is that really more
> efficient? There may be complexity involved in allowing the clearing of
> multiple pages and tracking of the clear in progress is additional
> overhead.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Another point is that if you zero pages off line from the application,
then depending on how many pre-zeroed pages an application finds, its
execution time may not be repeatable. I. e. if the system has been idle
a long time, then the BTEs will have zeroed all of the allocated hugetlbpags
and startup will be fast. But then the next time the application runs, those
pages it used and released will have to be zeroed before it can run. I think
I would vote for repeatability here rather than using the BTE offline to
zero pages.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-10-28 11:59:36

by Robin Holt

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

On Wed, Oct 27, 2004 at 06:01:12PM -0500, Ray Bryant wrote:
> Christoph Lameter wrote:
> >On Tue, 26 Oct 2004, Robin Holt wrote:
> >
> >
> >>Sorry for being a stickler here, but the BTE is really part of the
> >>I/O Interface portion of the shub. That portion has a seperate clock
> >>frequency from the memory controller (unfortunately slower). The BTE
> >>can zero at a slightly slower speed than the processor. It does, as
> >>you pointed out, not trash the CPU cache.
> >>
> >>One other feature of the BTE is it can operate asynchronously from
> >>the cpu. This could be used to, during a clock interrupt, schedule
> >>additional huge page zero filling on multiple nodes at the same time.
> >>This could result in a huge speed boost on machines that have multiple
> >>memory only nodes. That has not been tested thoroughly. We have done
> >>considerable testing of the page zero functionality as well as the
> >>error handling.
> >
> >
> >If the huge patch would support some way of redirecting the clearing of a
> >huge page then we could:
> >
> >1. set the huge pte to not present so that we get a fault on access
> >2. run the bte clearer.
> >3. On receiving a huge fault we could check for the bte being finished.
> >
> >This would parallelize the clearing of huge pages. But is that really more
> >efficient? There may be complexity involved in allowing the clearing of
> >multiple pages and tracking of the clear in progress is additional
> >overhead.
> >
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
> >
>
> I'm personally of the opinion that using the BTE to "speculatively" clear
> hugetlb pages in advance of when the hugetlb pages are requested is not a
> good
> thing [tm]. One never knows if those pages will ever be requested. And in
> the meantime, tasks that need the BTE will be delayed by speculative use.
> But that is a personal bias :-), with no data to back it up.

I was thinking the bte would be best used in an async mode where the pages
would be pre-zeroed and available for use if the application needs them.
If the pre-zeroed list is empty, then use the cpu to zero the page.

>
> AFAIK, it is faster to clear the page with the processor anyway, since the

The processor is slightly faster. I believe the FSB is 200Mhz and the
II is 100Mhz (150Mhz with no attached IX brick). Future versions of the
BTE will possibly have faster access to on node memory than the processor.

> processor has a faster clock cycle. Yes, it destroys the processor cache,
> but the application has clearly indicated that it wants the page NOW,
> please,
> (because it has faulted on it), and delivering the page to the application
> as quickly as possible sounds like a good thing. I'm not sure reloading

I am not either. I just would like to see any design take into consideration
the possible uses and not design them out. Nothing more.

> the processor cache at this point is a cost we care about, given that the
> application is likely just starting up anyway. I figure hugetlb pages are
> allocated once, stay around a long long time, so I'm not sure optimizing to
> minimize cache damage is the correct way to go here.
>
> The only obvious win is for memory only nodes, that have a BTE and no CPU.
> It is probably faster to use the local BTE than a remote CPU to clear the
> page.

Plus, a single CPU could schedule the clearing of pages on multiple
nodes at the same time. Imagine a system that has 256 compute nodes
and 756 memory nodes. That configuration is theoretically possible with
todays hardware, but we have never built or sold one. Looking at that
configuration gives you an one possible indication of how a pre-zeroing
mechanism might improve things.

I am not saying that the BTE is the best option, or even a good one. It
just looks interesting. It does bring up some interesting problems with
repeatability. Consider the application startup following termination
of another which used all the huge pages. The pre-zeroed list will
be nearly if not completely empty. The first fault will find the list
empty and have to zero the page itself. Hopefully, the second fault will
find one on the zeroed list and return immediately. This would cause
application startup time to feel like it doubled from the previous run.
Ouch. That would be very upsetting for our typical customers.

The more memory nodes you have per cpu, the better this number will
appear.

Sorry for being spineless, but I don't feel very strongly that it will
be beneficial enough to be desirable. I am just not sure. I would
just hope that it is taken into consideration during the design and,
as long as it has no negative impact on the design, be left as a
possibility.

Thanks,
Robin Holt

2004-10-28 16:29:04

by Ray Bryant

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview

Robin Holt wrote:
> On Wed, Oct 27, 2004 at 06:01:12PM -0500, Ray Bryant wrote:
>
>>Christoph Lameter wrote:
>>
>>>On Tue, 26 Oct 2004, Robin Holt wrote:
>>>
>>>
>>>
>>>>Sorry for being a stickler here, but the BTE is really part of the
>>>>I/O Interface portion of the shub. That portion has a seperate clock
>>>>frequency from the memory controller (unfortunately slower). The BTE
>>>>can zero at a slightly slower speed than the processor. It does, as
>>>>you pointed out, not trash the CPU cache.
>>>>
>>>>One other feature of the BTE is it can operate asynchronously from
>>>>the cpu. This could be used to, during a clock interrupt, schedule
>>>>additional huge page zero filling on multiple nodes at the same time.
>>>>This could result in a huge speed boost on machines that have multiple
>>>>memory only nodes. That has not been tested thoroughly. We have done
>>>>considerable testing of the page zero functionality as well as the
>>>>error handling.
>>>
>>>
>>>If the huge patch would support some way of redirecting the clearing of a
>>>huge page then we could:
>>>
>>>1. set the huge pte to not present so that we get a fault on access
>>>2. run the bte clearer.
>>>3. On receiving a huge fault we could check for the bte being finished.
>>>
>>>This would parallelize the clearing of huge pages. But is that really more
>>>efficient? There may be complexity involved in allowing the clearing of
>>>multiple pages and tracking of the clear in progress is additional
>>>overhead.
>>>
>>>
>>>-
>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>the body of a message to [email protected]
>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>Please read the FAQ at http://www.tux.org/lkml/
>>>
>>
>>I'm personally of the opinion that using the BTE to "speculatively" clear
>>hugetlb pages in advance of when the hugetlb pages are requested is not a
>>good
>>thing [tm]. One never knows if those pages will ever be requested. And in
>>the meantime, tasks that need the BTE will be delayed by speculative use.
>>But that is a personal bias :-), with no data to back it up.
>
>
> I was thinking the bte would be best used in an async mode where the pages
> would be pre-zeroed and available for use if the application needs them.
> If the pre-zeroed list is empty, then use the cpu to zero the page.
>
>
>>AFAIK, it is faster to clear the page with the processor anyway, since the
>
>
> The processor is slightly faster. I believe the FSB is 200Mhz and the
> II is 100Mhz (150Mhz with no attached IX brick). Future versions of the
> BTE will possibly have faster access to on node memory than the processor.
>
>
>>processor has a faster clock cycle. Yes, it destroys the processor cache,
>>but the application has clearly indicated that it wants the page NOW,
>>please,
>>(because it has faulted on it), and delivering the page to the application
>>as quickly as possible sounds like a good thing. I'm not sure reloading
>
>
> I am not either. I just would like to see any design take into consideration
> the possible uses and not design them out. Nothing more.
>
>
>>the processor cache at this point is a cost we care about, given that the
>>application is likely just starting up anyway. I figure hugetlb pages are
>>allocated once, stay around a long long time, so I'm not sure optimizing to
>>minimize cache damage is the correct way to go here.
>>
>>The only obvious win is for memory only nodes, that have a BTE and no CPU.
>>It is probably faster to use the local BTE than a remote CPU to clear the
>>page.
>
>
> Plus, a single CPU could schedule the clearing of pages on multiple
> nodes at the same time. Imagine a system that has 256 compute nodes
> and 756 memory nodes. That configuration is theoretically possible with
> todays hardware, but we have never built or sold one. Looking at that
> configuration gives you an one possible indication of how a pre-zeroing
> mechanism might improve things.
>
> I am not saying that the BTE is the best option, or even a good one. It
> just looks interesting. It does bring up some interesting problems with
> repeatability. Consider the application startup following termination
> of another which used all the huge pages. The pre-zeroed list will
> be nearly if not completely empty. The first fault will find the list
> empty and have to zero the page itself. Hopefully, the second fault will
> find one on the zeroed list and return immediately. This would cause
> application startup time to feel like it doubled from the previous run.
> Ouch. That would be very upsetting for our typical customers.
>

Yep.

> The more memory nodes you have per cpu, the better this number will
> appear.
>
> Sorry for being spineless, but I don't feel very strongly that it will
> be beneficial enough to be desirable. I am just not sure. I would
> just hope that it is taken into consideration during the design and,
> as long as it has no negative impact on the design, be left as a
> possibility.
>
> Thanks,
> Robin Holt
>

As always, Robin, you are being very reasonable. I think the option
should be kept open as you suggest, since it may help and I agree it
is an interesting approach that might yield big speedups.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2005-01-18 12:26:51

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [1/8]: hugetlb fault handler

Hi,

> ChangeLog
> * provide huge page fault handler and related things

<snip>

> Index: linux-2.6.9/fs/hugetlbfs/inode.c
> ===================================================================
> --- linux-2.6.9.orig/fs/hugetlbfs/inode.c 2004-10-18 14:55:07.000000000 -0700
> +++ linux-2.6.9/fs/hugetlbfs/inode.c 2004-10-21 14:50:14.000000000 -0700
> @@ -79,10 +79,6 @@
> if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
> goto out;
>
> - ret = hugetlb_prefault(mapping, vma);
> - if (ret)
> - goto out;
> -
> if (inode->i_size < len)
> inode->i_size = len;
> out:

hugetlbfs_file_mmap() may fail with a weird error, as it returns
uninitialized variable "ret".


Thanks.
Hirokazu Takahashi.

2005-01-18 16:34:04

by Christoph Lameter

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [1/8]: hugetlb fault handler

On Tue, 18 Jan 2005, Hirokazu Takahashi wrote:

> > Index: linux-2.6.9/fs/hugetlbfs/inode.c
> > ===================================================================
> > --- linux-2.6.9.orig/fs/hugetlbfs/inode.c 2004-10-18 14:55:07.000000000 -0700
> > +++ linux-2.6.9/fs/hugetlbfs/inode.c 2004-10-21 14:50:14.000000000 -0700
> > @@ -79,10 +79,6 @@
> > if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
> > goto out;
> >
> > - ret = hugetlb_prefault(mapping, vma);
> > - if (ret)
> > - goto out;
> > -
> > if (inode->i_size < len)
> > inode->i_size = len;
> > out:
>
> hugetlbfs_file_mmap() may fail with a weird error, as it returns
> uninitialized variable "ret".

Hmm. The current diff is:

@@ -79,11 +278,10 @@ static int hugetlbfs_file_mmap(struct fi
if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
goto out;

- ret = hugetlb_prefault(mapping, vma);
- if (ret)
- goto out;
+ ret = hugetlb_acct_commit(inode, VMACCTPG(vma->vm_pgoff),
+ VMACCTPG(vma->vm_pgoff + (vma_len >> PAGE_SHIFT)));

- if (inode->i_size < len)
+ if (ret >= 0 && inode->i_size < len)
inode->i_size = len;
out:
up(&inode->i_sem);

which does not leave ret uninitialized. Also this whole hugetlb
stuff has not been finalized yet and is not that high on my list of things
todo.

2005-02-02 12:29:19

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Hugepages demand paging V2 [3/8]: simple numa compatible allocator

Hi Christoph,


> Changelog
> * Simple NUMA compatible allocation of hugepages in the nearest node
>
> Index: linux-2.6.9/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.9.orig/mm/hugetlb.c 2004-10-22 13:28:27.000000000 -0700
> +++ linux-2.6.9/mm/hugetlb.c 2004-10-25 16:56:22.000000000 -0700
> @@ -32,14 +32,17 @@
> {
> int nid = numa_node_id();
> struct page *page = NULL;
> -
> - if (list_empty(&hugepage_freelists[nid])) {
> - for (nid = 0; nid < MAX_NUMNODES; ++nid)
> - if (!list_empty(&hugepage_freelists[nid]))
> - break;
> + struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists;


I think the previous line should be replaced with

struct zonelist *zonelist = NODE_DATA(nid)->node_zonelists + __GFP_HIGHMEM;

because NODE_DATA(nid)->node_zonelists means a zonelist for __GFP_DMA zones.
__GFP_HIGHMEM would be suitable for hugetlbpages.


> + struct zone **zones = zonelist->zones;
> + struct zone *z;
> + int i;
> +
> + for(i=0; (z = zones[i])!= NULL; i++) {
> + nid = z->zone_pgdat->node_id;
> + if (!list_empty(&hugepage_freelists[nid]))
> + break;
> }
> - if (nid >= 0 && nid < MAX_NUMNODES &&
> - !list_empty(&hugepage_freelists[nid])) {
> + if (z) {
> page = list_entry(hugepage_freelists[nid].next,
> struct page, lru);
> list_del(&page->lru);
>
> -

Thanks,
Hirokazu Takahashi.