LinuxLists.cc - [PATCH 0/3] Demand faulting for huge pages

2005-09-28 20:25:50

Subject: [PATCH 0/3] Demand faulting for huge pages

Hi Andrew. Can we give hugetlb demand faulting a spin in the mm tree?
And could people with alpha, sparc, and ia64 machines give them a good
spin? I haven't been able to test those arches yet.

-Thanks

- htlb-get_user_pages removes an optimization that is no longer valid
when demand faulting huge pages

- htlb-fault moves the fault logic from hugetlb_prefault() to
hugetlb_pte_fault() and find_get_huge_page().

- htlb-acct adds an overcommit check to maintain the no-overcommit
semantics provided by hugetlb_prefault()
--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2005-09-28 20:31:27

by Adam Litke

[permalink] [raw]

Subject: Re: [PATCH 1/3 htlb-get_user_pages] Demand faulting for huge pages

Initial Post (Thu, 18 Aug 2005)

In preparation for hugetlb demand faulting, remove this get_user_pages()
optimization. Since huge pages will no longer be prefaulted, we can't assume
that the huge ptes are established and hence, calling follow_hugetlb_page() is
not valid.

With the follow_hugetlb_page() call removed, the normal code path will be
triggered. follow_page() will either use follow_huge_addr() or
follow_huge_pmd() to check for a previously faulted "page" to return. When
this fails (ie. with demand faults), __handle_mm_fault() gets called which
invokes the hugetlb_fault() handler to instantiate the huge page.

This patch doesn't make a lot of sense by itself, but I've broken it out to
facilitate discussion on this specific element of the demand fault changes.
While coding this up, I referenced previous discussion on this topic starting
at http://lkml.org/lkml/2004/4/13/176 , which contains more opinions about the
correctness of this approach.

Diffed against 2.6.14-rc2-git6

Signed-off-by: Adam Litke <[email protected]>

---
memory.c | 5 -----
1 files changed, 5 deletions(-)
diff -upN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c
+++ current/mm/memory.c
@@ -949,11 +949,6 @@ int get_user_pages(struct task_struct *t
|| !(flags & vma->vm_flags))
return i ? : -EFAULT;

- if (is_vm_hugetlb_page(vma)) {
- i = follow_hugetlb_page(mm, vma, pages, vmas,
- &start, &len, i);
- continue;
- }
spin_lock(&mm->page_table_lock);
do {
int write_access = write;

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2005-09-28 20:32:34

by Adam Litke

[permalink] [raw]

Subject: Re: [PATCH 2/3 htlb-fault] Demand faulting for huge pages

Version 3 (Thu, 08 Sep 2005)
Organized logic in hugetlb_pte_fault() by breaking out
find_get_page/alloc_huge_page logic into separate function
Removed a few more paranoid checks <
Fixed tlb flushing in a race case < (thanks Yanmin Zhang)

Version 2 (Wed, 17 Aug 2005)
Removed spurious WARN_ON()
Patches added earlier in the series:
Check for p?d_none() in arch/i386/mm/hugetlbpage.c:huge_pte_offset()
Move i386 stale pte check into huge_pte_alloc()

Initial Post (Fri, 05 Aug 2005)

Below is a patch to implement demand faulting for huge pages. The main
motivation for changing from prefaulting to demand faulting is so that
huge page memory areas can be allocated according to NUMA policy.

Thanks to consolidated hugetlb code, switching the behavior requires changing
only one fault handler. The bulk of the patch just moves the logic from
hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().

Diffed against 2.6.14-rc2-git6

Signed-off-by: Adam Litke <[email protected]>
---
fs/hugetlbfs/inode.c | 6 -
include/linux/hugetlb.h | 2
mm/hugetlb.c | 154 +++++++++++++++++++++++++++++-------------------
mm/memory.c | 2
4 files changed, 98 insertions(+), 66 deletions(-)
diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c
+++ current/fs/hugetlbfs/inode.c
@@ -48,7 +48,6 @@ int sysctl_hugetlb_shm_group;
static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_dentry->d_inode;
- struct address_space *mapping = inode->i_mapping;
loff_t len, vma_len;
int ret;

@@ -79,10 +78,7 @@ static int hugetlbfs_file_mmap(struct fi
if (!(vma->vm_flags & VM_WRITE) && len > inode->i_size)
goto out;

- ret = hugetlb_prefault(mapping, vma);
- if (ret)
- goto out;
-
+ ret = 0;
if (inode->i_size < len)
inode->i_size = len;
out:
diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h
--- reference/include/linux/hugetlb.h
+++ current/include/linux/hugetlb.h
@@ -25,6 +25,8 @@ int is_hugepage_mem_enough(size_t);
unsigned long hugetlb_total_pages(void);
struct page *alloc_huge_page(void);
void free_huge_page(struct page *);
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct * vma,
+ unsigned long address, int write_access);

extern unsigned long max_huge_pages;
extern const unsigned long hugetlb_zero, hugetlb_infinity;
diff -upN reference/mm/hugetlb.c current/mm/hugetlb.c
--- reference/mm/hugetlb.c
+++ current/mm/hugetlb.c
@@ -274,21 +274,22 @@ int copy_hugetlb_page_range(struct mm_st
{
pte_t *src_pte, *dst_pte, entry;
struct page *ptepage;
- unsigned long addr = vma->vm_start;
+ unsigned long addr;
unsigned long end = vma->vm_end;

- while (addr < end) {
+ for (addr = vma->vm_start; addr < end; addr += HPAGE_SIZE) {
+ src_pte = huge_pte_offset(src, addr);
+ if (!src_pte || pte_none(*src_pte))
+ continue;
+
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
- src_pte = huge_pte_offset(src, addr);
- BUG_ON(!src_pte || pte_none(*src_pte)); /* prefaulted */
entry = *src_pte;
ptepage = pte_page(entry);
get_page(ptepage);
add_mm_counter(dst, rss, HPAGE_SIZE / PAGE_SIZE);
set_huge_pte_at(dst, addr, dst_pte, entry);
- addr += HPAGE_SIZE;
}
return 0;

@@ -338,61 +339,6 @@ void zap_hugepage_range(struct vm_area_s
spin_unlock(&mm->page_table_lock);
}

-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
- struct mm_struct *mm = current->mm;
- unsigned long addr;
- int ret = 0;
-
- WARN_ON(!is_vm_hugetlb_page(vma));
- BUG_ON(vma->vm_start & ~HPAGE_MASK);
- BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
- hugetlb_prefault_arch_hook(mm);
-
- spin_lock(&mm->page_table_lock);
- for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
- unsigned long idx;
- pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;
-
- if (!pte) {
- ret = -ENOMEM;
- goto out;
- }
-
- idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
- + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
- page = find_get_page(mapping, idx);
- if (!page) {
- /* charge the fs quota first */
- if (hugetlb_get_quota(mapping)) {
- ret = -ENOMEM;
- goto out;
- }
- page = alloc_huge_page();
- if (!page) {
- hugetlb_put_quota(mapping);
- ret = -ENOMEM;
- goto out;
- }
- ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
- if (! ret) {
- unlock_page(page);
- } else {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
- }
- add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
- set_huge_pte_at(mm, addr, pte, make_huge_pte(vma, page));
- }
-out:
- spin_unlock(&mm->page_table_lock);
- return ret;
-}
-
int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
unsigned long *position, int *length, int i)
@@ -440,3 +386,91 @@ int follow_hugetlb_page(struct mm_struct

return i;
}
+
+static struct page *find_get_huge_page(struct address_space *mapping,
+ unsigned long idx)
+{
+ struct page *page = NULL;
+
+retry:
+ page = find_get_page(mapping, idx);
+ if (page)
+ goto out;
+
+ if (hugetlb_get_quota(mapping))
+ goto out;
+ page = alloc_huge_page();
+ if (!page) {
+ hugetlb_put_quota(mapping);
+ goto out;
+ }
+
+ if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) {
+ put_page(page);
+ hugetlb_put_quota(mapping);
+ goto retry;
+ }
+ unlock_page(page);
+out:
+ return page;
+}
+
+static int hugetlb_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int write_access)
+{
+ int ret = VM_FAULT_MINOR;
+ unsigned long idx;
+ pte_t *pte;
+ struct page *page;
+ struct address_space *mapping;
+
+ BUG_ON(vma->vm_start & ~HPAGE_MASK);
+ BUG_ON(vma->vm_end & ~HPAGE_MASK);
+ BUG_ON(!vma->vm_file);
+
+ pte = huge_pte_offset(mm, address);
+ if (!pte) {
+ ret = VM_FAULT_SIGBUS;
+ goto out;
+ }
+ if (!pte_none(*pte))
+ goto out;
+
+ mapping = vma->vm_file->f_mapping;
+ idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+ + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+
+ page = find_get_huge_page(mapping, idx);
+ if (!page) {
+ ret = VM_FAULT_SIGBUS;
+ goto out;
+ }
+
+ add_mm_counter(mm, rss, HPAGE_SIZE / PAGE_SIZE);
+ set_huge_pte_at(mm, address, pte, make_huge_pte(vma, page));
+out:
+ return ret;
+}
+
+int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, int write_access)
+{
+ pte_t *ptep;
+ int rc = VM_FAULT_MINOR;
+
+ spin_lock(&mm->page_table_lock);
+
+ ptep = huge_pte_alloc(mm, address);
+ if (!ptep) {
+ rc = VM_FAULT_SIGBUS;
+ goto out;
+ }
+ if (pte_none(*ptep))
+ rc = hugetlb_pte_fault(mm, vma, address, write_access);
+out:
+ if (rc == VM_FAULT_MINOR)
+ flush_tlb_page(vma, address);
+
+ spin_unlock(&mm->page_table_lock);
+ return rc;
+}
diff -upN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c
+++ current/mm/memory.c
@@ -2041,7 +2041,7 @@ int __handle_mm_fault(struct mm_struct *
inc_page_state(pgfault);

if (is_vm_hugetlb_page(vma))
- return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ return hugetlb_fault(mm, vma, address, write_access);

/*
* We need the page table lock to synchronize with kswapd

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2005-09-28 20:33:21

by Adam Litke

[permalink] [raw]

Subject: Re: [PATCH 3/3 htlb-acct] Demand faulting for huge pages

Initial Post (Thu, 18 Aug 2005)

Basic overcommit checking for hugetlb_file_map() based on an implementation
used with demand faulting in SLES9.

Since demand faulting can't guarantee the availability of pages at mmap time,
this patch implements a basic sanity check to ensure that the number of huge
pages required to satisfy the mmap are currently available. Despite the
obvious race, I think it is a good start on doing proper accounting. I'd like
to work towards an accounting system that mimics the semantics of normal pages
(especially for the MAP_PRIVATE/COW case). That work is underway and builds on
what this patch starts.

Huge page shared memory segments are simpler and still maintain their commit on
shmget semantics.

Diffed against 2.6.14-rc2-git6

Signed-off-by: Adam Litke <[email protected]>
---
inode.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 47 insertions(+)
diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c
+++ current/fs/hugetlbfs/inode.c
@@ -45,9 +45,51 @@ static struct backing_dev_info hugetlbfs

int sysctl_hugetlb_shm_group;

+static void huge_pagevec_release(struct pagevec *pvec);
+
+unsigned long
+huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma)
+{
+ int i;
+ struct pagevec pvec;
+ unsigned long start = vma->vm_start;
+ unsigned long end = vma->vm_end;
+ unsigned long hugepages = (end - start) >> HPAGE_SHIFT;
+ pgoff_t next = vma->vm_pgoff;
+ pgoff_t endpg = next + ((end - start) >> PAGE_SHIFT);
+ struct inode *inode = vma->vm_file->f_dentry->d_inode;
+
+ /*
+ * Shared memory segments are accounted for at shget time,
+ * not at shmat (when the mapping is actually created) so
+ * check here if the memory has already been accounted for.
+ */
+ if (inode->i_blocks != 0)
+ return 0;
+
+ pagevec_init(&pvec, 0);
+ while (next < endpg) {
+ if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE))
+ break;
+ for (i = 0; i < pagevec_count(&pvec); i++) {
+ struct page *page = pvec.pages[i];
+ if (page->index > next)
+ next = page->index;
+ if (page->index >= endpg)
+ break;
+ next++;
+ hugepages--;
+ }
+ huge_pagevec_release(&pvec);
+ }
+ return hugepages << HPAGE_SHIFT;
+}
+
static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_dentry->d_inode;
+ struct address_space *mapping = inode->i_mapping;
+ unsigned long bytes;
loff_t len, vma_len;
int ret;

@@ -66,6 +108,10 @@ static int hugetlbfs_file_mmap(struct fi
if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
return -EINVAL;

+ bytes = huge_pages_needed(mapping, vma);
+ if (!is_hugepage_mem_enough(bytes))
+ return -ENOMEM;
+
vma_len = (loff_t)(vma->vm_end - vma->vm_start);

down(&inode->i_sem);
@@ -794,6 +840,7 @@ struct file *hugetlb_zero_setup(size_t s
d_instantiate(dentry, inode);
inode->i_size = size;
inode->i_nlink = 0;
+ inode->i_blocks = 1;
file->f_vfsmnt = mntget(hugetlbfs_vfsmount);
file->f_dentry = dentry;
file->f_mapping = inode->i_mapping;

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

2005-09-29 06:09:50

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 2/3 htlb-fault] Demand faulting for huge pages

Adam Litke <[email protected]> wrote:
>
> +static struct page *find_get_huge_page(struct address_space *mapping,
> + unsigned long idx)
> +{
> + struct page *page = NULL;
> +
> +retry:
> + page = find_get_page(mapping, idx);
> + if (page)
> + goto out;
> +
> + if (hugetlb_get_quota(mapping))
> + goto out;
> + page = alloc_huge_page();
> + if (!page) {
> + hugetlb_put_quota(mapping);
> + goto out;
> + }
> +
> + if (add_to_page_cache(page, mapping, idx, GFP_ATOMIC)) {
> + put_page(page);
> + hugetlb_put_quota(mapping);
> + goto retry;

If add_to_page_cache() fails due to failure in radix_tree_preload(), this
code will lock up.

A lame fix is to check for -ENOMEM and bale. A better fix would be to use
GFP_KERNEL.

2005-09-29 06:11:00

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 2/3 htlb-fault] Demand faulting for huge pages

Adam Litke <[email protected]> wrote:
>
> +int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, int write_access)
> +{
> + pte_t *ptep;
> + int rc = VM_FAULT_MINOR;
> +
> + spin_lock(&mm->page_table_lock);
> +
> + ptep = huge_pte_alloc(mm, address);
> + if (!ptep) {
> + rc = VM_FAULT_SIGBUS;
> + goto out;
> + }
> + if (pte_none(*ptep))
> + rc = hugetlb_pte_fault(mm, vma, address, write_access);
> +out:
> + if (rc == VM_FAULT_MINOR)
> + flush_tlb_page(vma, address);
> +
> + spin_unlock(&mm->page_table_lock);
> + return rc;
> +}

label `out' can be moved down a couple of lines.

2005-09-29 06:20:59

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 3/3 htlb-acct] Demand faulting for huge pages

Adam Litke <[email protected]> wrote:
>
> Initial Post (Thu, 18 Aug 2005)
>
> Basic overcommit checking for hugetlb_file_map() based on an implementation
> used with demand faulting in SLES9.
>
> Since demand faulting can't guarantee the availability of pages at mmap time,
> this patch implements a basic sanity check to ensure that the number of huge
> pages required to satisfy the mmap are currently available. Despite the
> obvious race, I think it is a good start on doing proper accounting. I'd like
> to work towards an accounting system that mimics the semantics of normal pages
> (especially for the MAP_PRIVATE/COW case). That work is underway and builds on
> what this patch starts.
>
> Huge page shared memory segments are simpler and still maintain their commit on
> shmget semantics.
>
> Diffed against 2.6.14-rc2-git6
>
> Signed-off-by: Adam Litke <[email protected]>
> ---
> inode.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 47 insertions(+)
> diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
> --- reference/fs/hugetlbfs/inode.c
> +++ current/fs/hugetlbfs/inode.c
> @@ -45,9 +45,51 @@ static struct backing_dev_info hugetlbfs
>
> int sysctl_hugetlb_shm_group;
>
> +static void huge_pagevec_release(struct pagevec *pvec);

nit: personally I prefer to move the helper function to the top of the file
rather than having to forward-declare it.

> +unsigned long
> +huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma)
> +{

What does this function do? Seems to count all the present pages within a
vma which are backed by a particular hugetlbfs file? Or something?

If so, the chosen name seems strange. And it definitely needs a decent
comment.

> + int i;
> + struct pagevec pvec;
> + unsigned long start = vma->vm_start;
> + unsigned long end = vma->vm_end;
> + unsigned long hugepages = (end - start) >> HPAGE_SHIFT;

`hugepages' is the size of the vma

> + pgoff_t next = vma->vm_pgoff;
> + pgoff_t endpg = next + ((end - start) >> PAGE_SHIFT);
> + struct inode *inode = vma->vm_file->f_dentry->d_inode;
> +
> + /*
> + * Shared memory segments are accounted for at shget time,
> + * not at shmat (when the mapping is actually created) so
> + * check here if the memory has already been accounted for.
> + */
> + if (inode->i_blocks != 0)
> + return 0;
> +
> + pagevec_init(&pvec, 0);
> + while (next < endpg) {
> + if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE))
> + break;
> + for (i = 0; i < pagevec_count(&pvec); i++) {
> + struct page *page = pvec.pages[i];
> + if (page->index > next)
> + next = page->index;
> + if (page->index >= endpg)
> + break;
> + next++;
> + hugepages--;

And we subtract one from it for each present page.

> + }
> + huge_pagevec_release(&pvec);
> + }
> + return hugepages << HPAGE_SHIFT;
> +}

So it seems to be returning the number of bytes which are still unpopulated
within this vma?

Think you can rework this code to reduce my perplexity?

2005-09-29 09:45:23

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 3/3 htlb-acct] Demand faulting for huge pages

Andrew Morton <[email protected]> writes:

(having written the original SLES9 code I will chime in ...)

> > +unsigned long
> > +huge_pages_needed(struct address_space *mapping, struct vm_area_struct *vma)
> > +{
>
> What does this function do? Seems to count all the present pages within a
> vma which are backed by a particular hugetlbfs file? Or something?

It counts how many huge pages are still needed to fill up a mapping completely.
In short it counts the holes. I think the name fits.

-Andi

2005-09-29 13:40:51

by Eric Dumazet

[permalink] [raw]

Subject: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ?

Hi Andi

I have a dual Opteron machine, with 8GB of ram on each node.
With latest kernels I have high CPU profiles in mm/slab.c
kfree() is NUMA aware, so far so good, but the price seems heavy.

I noticed in 2.6.14-rc2 syslog :

Node 0 MemBase 0000000000000000 Limit 00000001ffffffff
Node 1 MemBase 0000000200000000 Limit 00000003ffffffff
Using 23 for the hash shift. Max adder is 3ffffffff

instead of previous (2.6.13) :

Node 0 MemBase 0000000000000000 Limit 00000001ffffffff
Node 1 MemBase 0000000200000000 Limit 00000003ffffffff
Using 27 for the hash shift. Max adder is 3ffffffff

After some code review, I see NODEMAPSIZE raised from 0xff to 0xfff

phys_to_nid() is now reading one byte out of 2048 bytes with
(memnode_shift=23, units of 8MB).

But shouldnt we try to use the highest possible value for memnode_shift ?

Using memnode_shift=33 would access only 2 bytes from this memnodemap[],
touching fewer cache lines (well , one cache line). kfree() and friends would
be slightly faster, at least cache friendly.

Another question is :

Could we add in pda (struct x8664_pda) the node of the cpu ?

We currently do :

#define numa_node_id() (cpu_to_node(raw_smp_processor_id()))

Instead of reading the processor_id from pda, then access cpu_to_node[], we
could directly get this information from pda.

#if defined(CONFIG_NUMA)
static inline __attribute_pure__ int numa_node_id() { return read_pda(node);}
#else
#define numa_node_id() 0
#endif

Thank you

Eric

2005-09-29 13:32:41

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 0/3] Demand faulting for huge pages

On Wed, 28 Sep 2005, Adam Litke wrote:

> Hi Andrew. Can we give hugetlb demand faulting a spin in the mm tree?
> And could people with alpha, sparc, and ia64 machines give them a good
> spin? I haven't been able to test those arches yet.

It's going to be a little confusing if these go in while I'm moving
the page_table_lock inwards. My patches don't make a big difference
to hugetlb (I've not attempted splitting the lock at all for hugetlb -
there would be per-arch implementation issues and very little point -
though more point if we do move to hugetlb faulting). But I'm ill at
ease with changing the locking at one end while it's unclear whether
it's right at the other end.

Currently Adam's patches don't include my hugetlb changes already in
-mm; and I don't see any attention in his patches to the issue of
hugetlb file truncation, which I was fixing up in those.

The current hugetlb_prefault guards against this with i_sem held:
which _appears_ to be a lock ordering violation, but may not be,
since the official lock ordering is determined by the possibility
of fault within write, whereas hugetlb mmaps were never faulting.

Presumably on-demand hugetlb faulting would entail truncate_count
checking like do_no_page, and corresponding code in hugetlbfs.

I've no experience of hugetlb use. Personally, I'd be very happy with
a decision to disallow truncation of hugetlb files (seems odd to allow
ftruncate when read and write are not allowed, and the size normally
determined automatically by mmap size); but I have to assume that it's
been allowed for good reason.

> - htlb-get_user_pages removes an optimization that is no longer valid
> when demand faulting huge pages
>
> - htlb-fault moves the fault logic from hugetlb_prefault() to
> hugetlb_pte_fault() and find_get_huge_page().
>
> - htlb-acct adds an overcommit check to maintain the no-overcommit
> semantics provided by hugetlb_prefault()

Yes, I found that last one rather strange too. Doing it at creation
time based on i_size (and updating if i_size is allowed to change) is
one possibility; doing it at fault time whenever newly allocated is
another possibility; but these pagevec lookups at mmap time seem odd.

Hugh

2005-09-29 13:43:39

by Andi Kleen

[permalink] [raw]

Subject: Re: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ?

> Using memnode_shift=33 would access only 2 bytes from this memnodemap[],
> touching fewer cache lines (well , one cache line). kfree() and friends
> would be slightly faster, at least cache friendly.

Agreed. Please send a patch.

>
> Another question is :
>
> Could we add in pda (struct x8664_pda) the node of the cpu ?
>
> We currently do :
>
> #define numa_node_id() (cpu_to_node(raw_smp_processor_id()))
>
> Instead of reading the processor_id from pda, then access cpu_to_node[], we
> could directly get this information from pda.
>
> #if defined(CONFIG_NUMA)
> static inline __attribute_pure__ int numa_node_id() { return
> read_pda(node);}
> #else
> #define numa_node_id() 0
> #endif

Should be fine too. Please send a patch for that too.

-Andi

2005-09-29 16:59:44

by Eric Dumazet

[permalink] [raw]

Subject: Re: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ?

--- linux-2.6.14-rc2/arch/x86_64/mm/numa.c 2005-09-20 05:00:41.000000000 +0200
+++ linux-2.6.14-rc2-ed/arch/x86_64/mm/numa.c 2005-09-29 18:45:42.000000000 +0200
@@ -38,38 +38,62 @@

int numa_off __initdata;

-int __init compute_hash_shift(struct node *nodes, int numnodes)
+
+/*
+ * Given a shift value, try to populate memnodemap[]
+ * Returns :
+ * 1 if OK
+ * 0 if memnodmap[] too small (of shift too small)
+ * -1 if node overlap or lost ram (shift too big)
+ */
+static int __init populate_memnodemap(
+ const struct node *nodes, int numnodes, int shift)
{
int i;
- int shift = 20;
- unsigned long addr,maxend=0;
-
- for (i = 0; i < numnodes; i++)
- if ((nodes[i].start != nodes[i].end) && (nodes[i].end > maxend))
- maxend = nodes[i].end;
-
- while ((1UL << shift) < (maxend / NODEMAPSIZE))
- shift++;
+ int res = -1;
+ unsigned long addr, end, lost ;

- printk (KERN_DEBUG"Using %d for the hash shift. Max adder is %lx \n",
- shift,maxend);
- memset(memnodemap,0xff,sizeof(*memnodemap) * NODEMAPSIZE);
+ memset(memnodemap, 0xff, sizeof(memnodemap));
for (i = 0; i < numnodes; i++) {
- if (nodes[i].start == nodes[i].end)
+ addr = (nodes[i].start >> shift);
+ end = ((nodes[i].end + 1) >> shift);
+
+ /* check we dont loose more than 16 MB of ram */
+ lost = (nodes[i].end + 1) - (end << shift);
+ if (lost >= (1<<24))
+ return -1;
+
+ if (addr >= end)
continue;
- for (addr = nodes[i].start;
- addr < nodes[i].end;
- addr += (1UL << shift)) {
- if (memnodemap[addr >> shift] != 0xff) {
- printk(KERN_INFO
- "Your memory is not aligned you need to rebuild your kernel "
- "with a bigger NODEMAPSIZE shift=%d adder=%lu\n",
- shift,addr);
+ if (end >= NODEMAPSIZE)
+ return 0;
+ res = 1;
+ for (; addr < end; addr++) {
+ if (memnodemap[addr] != 0xff)
return -1;
- }
- memnodemap[addr >> shift] = i;
- }
+ memnodemap[addr] = i;
+ }
}
+ return res;
+}
+
+int __init compute_hash_shift(struct node *nodes, int numnodes)
+{
+ int shift = 20;
+
+ while (populate_memnodemap(nodes, numnodes, shift + 1) >= 0)
+ shift++;
+
+ printk(KERN_DEBUG "Using %d for the hash shift.\n",
+ shift);
+
+ if (populate_memnodemap(nodes, numnodes, shift) != 1) {
+ printk(KERN_INFO
+ "Your memory is not aligned you need to rebuild your kernel "
+ "with a bigger NODEMAPSIZE shift=%d\n",
+ shift);
+ return -1;
+ }
return shift;
}

Attachments:

compute_hash_shift (2.35 kB)

2005-09-30 09:09:26

by Eric Dumazet

[permalink] [raw]

Subject: Re: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ?

--- linux-2.6.14-rc2/arch/x86_64/mm/numa.c 2005-09-20 05:00:41.000000000 +0200
+++ linux-2.6.14-rc2-ed/arch/x86_64/mm/numa.c 2005-09-30 11:04:18.000000000 +0200
@@ -38,38 +38,57 @@

int numa_off __initdata;

-int __init compute_hash_shift(struct node *nodes, int numnodes)
+
+/*
+ * Given a shift value, try to populate memnodemap[]
+ * Returns :
+ * 1 if OK
+ * 0 if memnodmap[] too small (of shift too small)
+ * -1 if node overlap or lost ram (shift too big)
+ */
+static int __init populate_memnodemap(
+ const struct node *nodes, int numnodes, int shift)
{
int i;
- int shift = 20;
- unsigned long addr,maxend=0;
-
- for (i = 0; i < numnodes; i++)
- if ((nodes[i].start != nodes[i].end) && (nodes[i].end > maxend))
- maxend = nodes[i].end;
+ int res = -1;
+ unsigned long addr, end;

- while ((1UL << shift) < (maxend / NODEMAPSIZE))
- shift++;
-
- printk (KERN_DEBUG"Using %d for the hash shift. Max adder is %lx \n",
- shift,maxend);
- memset(memnodemap,0xff,sizeof(*memnodemap) * NODEMAPSIZE);
+ memset(memnodemap, 0xff, sizeof(memnodemap));
for (i = 0; i < numnodes; i++) {
- if (nodes[i].start == nodes[i].end)
+ addr = nodes[i].start;
+ end = nodes[i].end;
+ if (addr >= end)
continue;
- for (addr = nodes[i].start;
- addr < nodes[i].end;
- addr += (1UL << shift)) {
- if (memnodemap[addr >> shift] != 0xff) {
- printk(KERN_INFO
- "Your memory is not aligned you need to rebuild your kernel "
- "with a bigger NODEMAPSIZE shift=%d adder=%lu\n",
- shift,addr);
+ if ((end >> shift) >= NODEMAPSIZE)
+ return 0;
+ do {
+ if (memnodemap[addr >> shift] != 0xff)
return -1;
- }
memnodemap[addr >> shift] = i;
- }
+ addr += (1 << shift);
+ } while (addr < end);
+ res = 1;
}
+ return res;
+}
+
+int __init compute_hash_shift(struct node *nodes, int numnodes)
+{
+ int shift = 20;
+
+ while (populate_memnodemap(nodes, numnodes, shift + 1) >= 0)
+ shift++;
+
+ printk(KERN_DEBUG "Using %d for the hash shift.\n",
+ shift);
+
+ if (populate_memnodemap(nodes, numnodes, shift) != 1) {
+ printk(KERN_INFO
+ "Your memory is not aligned you need to rebuild your kernel "
+ "with a bigger NODEMAPSIZE shift=%d\n",
+ shift);
+ return -1;
+ }
return shift;
}

Attachments:

compute_hash_shift (2.19 kB)

2005-10-04 17:13:15

by Andi Kleen

[permalink] [raw]

Subject: Re: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ?

On Friday 30 September 2005 11:09, Eric Dumazet wrote:
> + while (populate_memnodemap(nodes, numnodes, shift + 1) >= 0)
> + shift++;

Why shift+1 here?

>+ if ((end >> shift) >= NODEMAPSIZE)
>+ return 0;

This should be >, not >= shouldn't it?

-Andi

P.S.: Please cc x86-64 patches to [email protected]

2005-10-04 21:12:13

by Eric Dumazet

[permalink] [raw]

Subject: Re: [NUMA , x86_64] Why memnode_shift is chosen with the lowest possible value ?

Andi Kleen a écrit :
> On Friday 30 September 2005 11:09, Eric Dumazet wrote:
>
>>+ while (populate_memnodemap(nodes, numnodes, shift + 1) >= 0)
>>+ shift++;
>
>
>
> Why shift+1 here?

Thank you Andi fo r reviewing this stuff

The idea it to find the highest shift value, and to break the loop as soon as
the (shift + 1) value gives us an "shift too big" error.

Maybe you want to write :

while (populate_memnodemap(nodes, numnodes, ++shift) >= 0) ;
shift--;

Well, thats only style...

>
>
>>+ if ((end >> shift) >= NODEMAPSIZE)
>>+ return 0;
>
>
> This should be >, not >= shouldn't it?

Let's take an example

end = 0xffffffff;
start = 0xfff00000;
shift = 20
Suppose that NODEMAPSIZE == (end >> shift) == 0xfff

If the test is changed to :

if ((end >> shift) > NODEMAPSIZE)
return 0;

We could do one of the iteration with (addr < end) but (addr >> shift) ==
NODEMAPSIZE

if (memnodemap[NODEMAPSIZE] != 0xff)
return -1;
memnodemap[NODMAPSIZE] = i;

Thats bound violation of memnodemap[]

AFAIK, I wonder why NODEMAPSIZE is 0xfff and not 0x1000, because this off by
one make half of memnodemap[] to be unused for power of two ram size.

>
> -Andi
>
> P.S.: Please cc x86-64 patches to [email protected]

Ah thank you

Eric

2005-10-06 15:42:55

by Adam Litke

[permalink] [raw]

Subject: Re: [PATCH 0/3] Demand faulting for huge pages

On Thu, 2005-09-29 at 14:32 +0100, Hugh Dickins wrote:
> On Wed, 28 Sep 2005, Adam Litke wrote:
>
> > Hi Andrew. Can we give hugetlb demand faulting a spin in the mm tree?
> > And could people with alpha, sparc, and ia64 machines give them a good
> > spin? I haven't been able to test those arches yet.
>
> It's going to be a little confusing if these go in while I'm moving
> the page_table_lock inwards. My patches don't make a big difference
> to hugetlb (I've not attempted splitting the lock at all for hugetlb -
> there would be per-arch implementation issues and very little point -
> though more point if we do move to hugetlb faulting). But I'm ill at
> ease with changing the locking at one end while it's unclear whether
> it's right at the other end.
>
> Currently Adam's patches don't include my hugetlb changes already in
> -mm; and I don't see any attention in his patches to the issue of
> hugetlb file truncation, which I was fixing up in those.
>
> The current hugetlb_prefault guards against this with i_sem held:
> which _appears_ to be a lock ordering violation, but may not be,
> since the official lock ordering is determined by the possibility
> of fault within write, whereas hugetlb mmaps were never faulting.
>
> Presumably on-demand hugetlb faulting would entail truncate_count
> checking like do_no_page, and corresponding code in hugetlbfs.
>
> I've no experience of hugetlb use. Personally, I'd be very happy with
> a decision to disallow truncation of hugetlb files (seems odd to allow
> ftruncate when read and write are not allowed, and the size normally
> determined automatically by mmap size); but I have to assume that it's
> been allowed for good reason.

If I were to spend time coding up a patch to remove truncation support
for hugetlbfs, would it be something other people would want to see
merged as well?

--
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center