2004-03-25 16:51:36

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH] [0/6] HUGETLB memory commitment

Here is the latest incarnation of my hugetlb patches. Rediffed
against 2.6.5-rc2-bk4. With the addition of
080-mem_acctdom_hugetlb_sysctl which generalises the sysctl
support and uses it for hugetlb.

The overall problem is described below. Feedback and testing
appreciated.

Cheers.

-apw


HUGETLB Overcommit Handling
---------------------------
When building mappings the kernel tracks committed but not yet
allocated pages against available memory and swap preventing memory
allocation problems later. The introduction of hugetlb pages has
has significant ramifications for this accounting as the pages used
to back them are already removed from the available memory pool.
Currently mapping involving these pages are still accounted against
the small page pool leading to either over commitment of the normal
page pool, or incorrectly failed hugetlb allocations in the case
where hugetlb memory exceeds the remaining normal pool. Also as
there is no commitment tracking on hugetlb pages it is not possible
to safely fault them on demand which is a problem for large segments
where the prefault and clear times are excessive.

This patch set attempts to addresses of these issues and provide a
platform for fixing the remainder. Firstly, by removing the hugetlb
allocations from the normal page pool. Secondly, by introducing a
general mechanism for accounting for multiple page pools. Thirdly,
by implmenting and enforcing hugetlb commitments via these pools.

050-mem_acctdom_core: core changes to create two accounting domains
055-mem_acctdom_arch: architecture specific changes for above
060-mem_acctdom_commitments: splits vm_committed into a per domain count
070-mem_acctdom_hugetlb: use vm_committed to track HUGETLB usage
075-mem_acctdom_hugetlb_arch: architecture specific changes for above
080-mem_acctdom_hugetlb_sysctl: generalise sysctl parameters and add hugetlb

These first two patches introduce the concept of a split between
the default and hugetlb memory pools and stop the hugtlb pool
being accounted at all. This is not as clean as I would like,
particularly the need to check against VM_AD_DEFAULT in a few places.

The third patch splits the vm_committed count into a per domain
count and exposes the domain in the interface.

The fourth and fifth patches convert hugetlb to use the vm_commitment
interfaces exposed above.

The sixth patch generalises the overcommit mode and rations to all
domains and adds support for controlling the hugetlb domain with it.

Below is a transcript of a test showing the commitments being
applied. The test attempts to make 3 400x2MB page shared memory
segments, 850 pages are available. The main things to note are
the commitment against the pages at shmget() time. This is a
prerequisite for reliable accounting under fault driven page
instantiation.

[root@kite apw]# ./tester
kernel.shmmax = 2147483648
kernel.shmall = 2147483648
vm.nr_hugepages = 850
=== FIRST ===
=== before shmget ===
HugePages_Total: 850
HugePages_Free: 850
Hugepagesize: 2048 kB
HugeCommited_AS: 0 kB
=== before shmat ===
HugePages_Total: 850
HugePages_Free: 850
Hugepagesize: 2048 kB
HugeCommited_AS: 819200 kB
test: shmat smp=42200000
=== after shmat ===
HugePages_Total: 850
HugePages_Free: 450
Hugepagesize: 2048 kB
HugeCommited_AS: 819200 kB
=== SECOND ===
=== before shmget ===
HugePages_Total: 850
HugePages_Free: 450
Hugepagesize: 2048 kB
HugeCommited_AS: 819200 kB
=== before shmat ===
HugePages_Total: 850
HugePages_Free: 450
Hugepagesize: 2048 kB
HugeCommited_AS: 1638400 kB
test: shmat smp=42200000
=== after shmat ===
HugePages_Total: 850
HugePages_Free: 50
Hugepagesize: 2048 kB
HugeCommited_AS: 1638400 kB
=== THIRD ===
=== before shmget ===
HugePages_Total: 850
HugePages_Free: 50
Hugepagesize: 2048 kB
HugeCommited_AS: 1638400 kB
test: shmget failed - errno=12
=== before ipcrm -M 0xdead0000 ===
HugePages_Total: 850
HugePages_Free: 50
Hugepagesize: 2048 kB
HugeCommited_AS: 1638400 kB
=== before ipcrm -M 0xdead0001 ===
HugePages_Total: 850
HugePages_Free: 450
Hugepagesize: 2048 kB
HugeCommited_AS: 819200 kB
=== before ipcrm -M 0xdead0002 ===
HugePages_Total: 850
HugePages_Free: 850
Hugepagesize: 2048 kB
HugeCommited_AS: 0 kB
ipcrm: invalid key (0xdead0002)
=== after ===
HugePages_Total: 850
HugePages_Free: 850
Hugepagesize: 2048 kB
HugeCommited_AS: 0 kB
vm.nr_hugepages = 0


2004-03-25 16:56:24

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH] [2/6] HUGETLB memory commitment

[055-mem_acctdom_arch]

Memory accounting domains (arch)

---
ia64/ia32/binfmt_elf32.c | 3 ++-
mips/kernel/sysirix.c | 3 ++-
s390/kernel/compat_exec.c | 3 ++-
x86_64/ia32/ia32_binfmt.c | 3 ++-
4 files changed, 8 insertions(+), 4 deletions(-)

diff -upN reference/arch/ia64/ia32/binfmt_elf32.c current/arch/ia64/ia32/binfmt_elf32.c
--- reference/arch/ia64/ia32/binfmt_elf32.c 2004-03-11 20:47:12.000000000 +0000
+++ current/arch/ia64/ia32/binfmt_elf32.c 2004-03-25 15:03:32.000000000 +0000
@@ -168,7 +168,8 @@ ia32_setup_arg_pages (struct linux_binpr
if (!mpnt)
return -ENOMEM;

- if (security_vm_enough_memory((IA32_STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
+ if (security_vm_enough_memory(VM_AD_DEFAULT, (IA32_STACK_TOP -
+ (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
kmem_cache_free(vm_area_cachep, mpnt);
return -ENOMEM;
}
diff -upN reference/arch/mips/kernel/sysirix.c current/arch/mips/kernel/sysirix.c
--- reference/arch/mips/kernel/sysirix.c 2004-03-11 20:47:13.000000000 +0000
+++ current/arch/mips/kernel/sysirix.c 2004-03-25 15:03:32.000000000 +0000
@@ -578,7 +578,8 @@ asmlinkage int irix_brk(unsigned long br
/*
* Check if we have enough memory..
*/
- if (security_vm_enough_memory((newbrk-oldbrk) >> PAGE_SHIFT)) {
+ if (security_vm_enough_memory(VM_AD_DEFAULT,
+ (newbrk-oldbrk) >> PAGE_SHIFT)) {
ret = -ENOMEM;
goto out;
}
diff -upN reference/arch/s390/kernel/compat_exec.c current/arch/s390/kernel/compat_exec.c
--- reference/arch/s390/kernel/compat_exec.c 2004-01-09 06:59:57.000000000 +0000
+++ current/arch/s390/kernel/compat_exec.c 2004-03-25 15:03:32.000000000 +0000
@@ -56,7 +56,8 @@ int setup_arg_pages32(struct linux_binpr
if (!mpnt)
return -ENOMEM;

- if (security_vm_enough_memory((STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
+ if (security_vm_enough_memory(VM_AD_DEFAULT, (STACK_TOP -
+ (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
kmem_cache_free(vm_area_cachep, mpnt);
return -ENOMEM;
}
diff -upN reference/arch/x86_64/ia32/ia32_binfmt.c current/arch/x86_64/ia32/ia32_binfmt.c
--- reference/arch/x86_64/ia32/ia32_binfmt.c 2004-03-25 02:42:14.000000000 +0000
+++ current/arch/x86_64/ia32/ia32_binfmt.c 2004-03-25 15:03:32.000000000 +0000
@@ -344,7 +344,8 @@ int setup_arg_pages(struct linux_binprm
if (!mpnt)
return -ENOMEM;

- if (security_vm_enough_memory((IA32_STACK_TOP - (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
+ if (security_vm_enough_memory(VM_AD_DEFAULT, (IA32_STACK_TOP -
+ (PAGE_MASK & (unsigned long) bprm->p))>>PAGE_SHIFT)) {
kmem_cache_free(vm_area_cachep, mpnt);
return -ENOMEM;
}

2004-03-25 16:55:34

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH] [1/6] HUGETLB memory commitment

[050-mem_acctdom_core]

Memory accounting domains (core)

When hugetlb memory is in user we effectivly split memory into to
two independent and non-overlapping 'page' pools from which we can
allocate pages and against which we wish to handle commitments.
Currently all allocations are accounted against the normal page pool
which can lead to false allocation failures.

This patch provides the framework to allow these pools to be treated
separatly, preventing allocation in the hugetlb pool from being accounted
against the small page pool. The hugetlb page pool is not accounted at all
and effectibly is treated as in overcommit mode.

The patch creates the concept of an accounting domain, against which
pages are to be accounted. In this implementation there are two
domains VM_AD_DEFAULT which is used to account normal small pages
in the normal way and VM_AD_HUGETLB which is used to select and
identify VM_HUGETLB pages. I have not attempted to add any actual
accounting for VM_HUGETLB pages, as currently they are prefaulted and
thus there is always 0 outstanding commitment to track. Obviously,
if hugetlb was also changed to support demand paging that would

---
fs/exec.c | 2 +-
include/linux/mm.h | 6 ++++++
include/linux/security.h | 15 ++++++++-------
kernel/fork.c | 8 +++++---
mm/memory.c | 1 +
mm/mmap.c | 18 +++++++++++-------
mm/mprotect.c | 5 +++--
mm/mremap.c | 4 ++--
mm/shmem.c | 10 ++++++----
mm/swapfile.c | 2 +-
security/commoncap.c | 8 +++++++-
security/dummy.c | 8 +++++++-
security/selinux/hooks.c | 8 +++++++-
13 files changed, 65 insertions(+), 30 deletions(-)

diff -upN reference/fs/exec.c current/fs/exec.c
--- reference/fs/exec.c 2004-03-11 20:47:24.000000000 +0000
+++ current/fs/exec.c 2004-03-25 15:03:32.000000000 +0000
@@ -409,7 +409,7 @@ int setup_arg_pages(struct linux_binprm
if (!mpnt)
return -ENOMEM;

- if (security_vm_enough_memory(arg_size >> PAGE_SHIFT)) {
+ if (security_vm_enough_memory(VM_AD_DEFAULT, arg_size >> PAGE_SHIFT)) {
kmem_cache_free(vm_area_cachep, mpnt);
return -ENOMEM;
}
diff -upN reference/include/linux/mm.h current/include/linux/mm.h
--- reference/include/linux/mm.h 2004-03-25 02:43:39.000000000 +0000
+++ current/include/linux/mm.h 2004-03-25 15:03:32.000000000 +0000
@@ -112,6 +112,12 @@ struct vm_area_struct {
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */

+/* Memory accounting domains. */
+#define VM_ACCTDOM_NR 2
+#define VM_ACCTDOM(vma) (!!((vma)->vm_flags & VM_HUGETLB))
+#define VM_AD_DEFAULT 0
+#define VM_AD_HUGETLB 1
+
#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
#endif
diff -upN reference/include/linux/security.h current/include/linux/security.h
--- reference/include/linux/security.h 2004-03-25 02:43:39.000000000 +0000
+++ current/include/linux/security.h 2004-03-25 15:03:32.000000000 +0000
@@ -51,7 +51,7 @@ extern int cap_inode_removexattr(struct
extern int cap_task_post_setuid (uid_t old_ruid, uid_t old_euid, uid_t old_suid, int flags);
extern void cap_task_reparent_to_init (struct task_struct *p);
extern int cap_syslog (int type);
-extern int cap_vm_enough_memory (long pages);
+extern int cap_vm_enough_memory (int domain, long pages);

static inline int cap_netlink_send (struct sk_buff *skb)
{
@@ -988,7 +988,8 @@ struct swap_info_struct;
* @type contains the type of action.
* Return 0 if permission is granted.
* @vm_enough_memory:
- * Check permissions for allocating a new virtual mapping.
+ * Check permissions for allocating a new virtual mapping.
+ * @domain contains the accounting domain.
* @pages contains the number of pages.
* Return 0 if permission is granted.
*
@@ -1022,7 +1023,7 @@ struct security_operations {
int (*quotactl) (int cmds, int type, int id, struct super_block * sb);
int (*quota_on) (struct file * f);
int (*syslog) (int type);
- int (*vm_enough_memory) (long pages);
+ int (*vm_enough_memory) (int domain, long pages);

int (*bprm_alloc_security) (struct linux_binprm * bprm);
void (*bprm_free_security) (struct linux_binprm * bprm);
@@ -1277,9 +1278,9 @@ static inline int security_syslog(int ty
return security_ops->syslog(type);
}

-static inline int security_vm_enough_memory(long pages)
+static inline int security_vm_enough_memory(int domain, long pages)
{
- return security_ops->vm_enough_memory(pages);
+ return security_ops->vm_enough_memory(domain, pages);
}

static inline int security_bprm_alloc (struct linux_binprm *bprm)
@@ -1949,9 +1950,9 @@ static inline int security_syslog(int ty
return cap_syslog(type);
}

-static inline int security_vm_enough_memory(long pages)
+static inline int security_vm_enough_memory(int domain, long pages)
{
- return cap_vm_enough_memory(pages);
+ return cap_vm_enough_memory(domain, pages);
}

static inline int security_bprm_alloc (struct linux_binprm *bprm)
diff -upN reference/kernel/fork.c current/kernel/fork.c
--- reference/kernel/fork.c 2004-03-11 20:47:29.000000000 +0000
+++ current/kernel/fork.c 2004-03-25 15:03:32.000000000 +0000
@@ -301,9 +301,10 @@ static inline int dup_mmap(struct mm_str
continue;
if (mpnt->vm_flags & VM_ACCOUNT) {
unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
- if (security_vm_enough_memory(len))
+ if (security_vm_enough_memory(VM_ACCTDOM(mpnt), len))
goto fail_nomem;
- charge += len;
+ if (VM_ACCTDOM(mpnt) == VM_AD_DEFAULT)
+ charge += len;
}
tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (!tmp)
@@ -358,7 +359,8 @@ out:
fail_nomem:
retval = -ENOMEM;
fail:
- vm_unacct_memory(charge);
+ if (charge)
+ vm_unacct_memory(charge);
goto out;
}
static inline int mm_alloc_pgd(struct mm_struct * mm)
diff -upN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c 2004-03-25 02:43:43.000000000 +0000
+++ current/mm/memory.c 2004-03-25 15:03:32.000000000 +0000
@@ -551,6 +551,7 @@ int unmap_vmas(struct mmu_gather **tlbp,
if (end <= vma->vm_start)
continue;

+ /* We assume that only accountable VMAs are VM_ACCOUNT. */
if (vma->vm_flags & VM_ACCOUNT)
*nr_accounted += (end - start) >> PAGE_SHIFT;

diff -upN reference/mm/mmap.c current/mm/mmap.c
--- reference/mm/mmap.c 2004-03-25 02:43:43.000000000 +0000
+++ current/mm/mmap.c 2004-03-25 15:03:32.000000000 +0000
@@ -490,8 +490,11 @@ unsigned long do_mmap_pgoff(struct file
int error;
struct rb_node ** rb_link, * rb_parent;
unsigned long charged = 0;
+ long acctdom = VM_AD_DEFAULT;

if (file) {
+ if (is_file_hugepages(file))
+ acctdom = VM_AD_HUGETLB;
if (!file->f_op || !file->f_op->mmap)
return -ENODEV;

@@ -608,7 +611,8 @@ munmap_back:
> current->rlim[RLIMIT_AS].rlim_cur)
return -ENOMEM;

- if (!(flags & MAP_NORESERVE) || sysctl_overcommit_memory > 1) {
+ if (acctdom == VM_AD_DEFAULT && (!(flags & MAP_NORESERVE) ||
+ sysctl_overcommit_memory > 1)) {
if (vm_flags & VM_SHARED) {
/* Check memory availability in shmem_file_setup? */
vm_flags |= VM_ACCOUNT;
@@ -617,7 +621,7 @@ munmap_back:
* Private writable mapping: check memory availability
*/
charged = len >> PAGE_SHIFT;
- if (security_vm_enough_memory(charged))
+ if (security_vm_enough_memory(acctdom, charged))
return -ENOMEM;
vm_flags |= VM_ACCOUNT;
}
@@ -926,8 +930,8 @@ int expand_stack(struct vm_area_struct *
spin_lock(&vma->vm_mm->page_table_lock);
grow = (address - vma->vm_end) >> PAGE_SHIFT;

- /* Overcommit.. */
- if (security_vm_enough_memory(grow)) {
+ /* Overcommit ... assume stack is in normal memory */
+ if (security_vm_enough_memory(VM_AD_DEFAULT, grow)) {
spin_unlock(&vma->vm_mm->page_table_lock);
return -ENOMEM;
}
@@ -980,8 +984,8 @@ int expand_stack(struct vm_area_struct *
spin_lock(&vma->vm_mm->page_table_lock);
grow = (vma->vm_start - address) >> PAGE_SHIFT;

- /* Overcommit.. */
- if (security_vm_enough_memory(grow)) {
+ /* Overcommit ... assume stack is in normal memory */
+ if (security_vm_enough_memory(VM_AD_DEFAULT, grow)) {
spin_unlock(&vma->vm_mm->page_table_lock);
return -ENOMEM;
}
@@ -1378,7 +1382,7 @@ unsigned long do_brk(unsigned long addr,
if (mm->map_count > MAX_MAP_COUNT)
return -ENOMEM;

- if (security_vm_enough_memory(len >> PAGE_SHIFT))
+ if (security_vm_enough_memory(VM_AD_DEFAULT, len >> PAGE_SHIFT))
return -ENOMEM;

flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
diff -upN reference/mm/mprotect.c current/mm/mprotect.c
--- reference/mm/mprotect.c 2004-03-25 15:03:28.000000000 +0000
+++ current/mm/mprotect.c 2004-03-25 15:03:32.000000000 +0000
@@ -173,9 +173,10 @@ mprotect_fixup(struct vm_area_struct *vm
* a MAP_NORESERVE private mapping to writable will now reserve.
*/
if (newflags & VM_WRITE) {
- if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED))) {
+ if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED)) &&
+ VM_ACCTDOM(vma) == VM_AD_DEFAULT) {
charged = (end - start) >> PAGE_SHIFT;
- if (security_vm_enough_memory(charged))
+ if (security_vm_enough_memory(VM_ACCTDOM(vma), charged))
return -ENOMEM;
newflags |= VM_ACCOUNT;
}
diff -upN reference/mm/mremap.c current/mm/mremap.c
--- reference/mm/mremap.c 2004-02-23 18:15:13.000000000 +0000
+++ current/mm/mremap.c 2004-03-25 15:03:32.000000000 +0000
@@ -400,10 +400,10 @@ unsigned long do_mremap(unsigned long ad

if (vma->vm_flags & VM_ACCOUNT) {
charged = (new_len - old_len) >> PAGE_SHIFT;
- if (security_vm_enough_memory(charged))
+ if (security_vm_enough_memory(VM_ACCTDOM(vma), charged))
goto out_nc;
}
-
+
/* old_len exactly to the end of the area..
* And we're not relocating the area.
*/
diff -upN reference/mm/shmem.c current/mm/shmem.c
--- reference/mm/shmem.c 2004-03-25 02:43:43.000000000 +0000
+++ current/mm/shmem.c 2004-03-25 15:03:32.000000000 +0000
@@ -526,7 +526,7 @@ static int shmem_notify_change(struct de
*/
change = VM_ACCT(attr->ia_size) - VM_ACCT(inode->i_size);
if (change > 0) {
- if (security_vm_enough_memory(change))
+ if (security_vm_enough_memory(VM_AD_DEFAULT, change))
return -ENOMEM;
} else if (attr->ia_size < inode->i_size) {
vm_unacct_memory(-change);
@@ -1187,7 +1187,8 @@ shmem_file_write(struct file *file, cons
maxpos = inode->i_size;
if (maxpos < pos + count) {
maxpos = pos + count;
- if (security_vm_enough_memory(VM_ACCT(maxpos) - VM_ACCT(inode->i_size))) {
+ if (security_vm_enough_memory(VM_AD_DEFAULT,
+ VM_ACCT(maxpos) - VM_ACCT(inode->i_size))) {
err = -ENOMEM;
goto out;
}
@@ -1551,7 +1552,7 @@ static int shmem_symlink(struct inode *d
memcpy(info, symname, len);
inode->i_op = &shmem_symlink_inline_operations;
} else {
- if (security_vm_enough_memory(VM_ACCT(1))) {
+ if (security_vm_enough_memory(VM_AD_DEFAULT, VM_ACCT(1))) {
iput(inode);
return -ENOMEM;
}
@@ -1947,7 +1948,8 @@ struct file *shmem_file_setup(char *name
if (size > SHMEM_MAX_BYTES)
return ERR_PTR(-EINVAL);

- if ((flags & VM_ACCOUNT) && security_vm_enough_memory(VM_ACCT(size)))
+ if ((flags & VM_ACCOUNT) && security_vm_enough_memory(VM_AD_DEFAULT,
+ VM_ACCT(size)))
return ERR_PTR(-ENOMEM);

error = -ENOMEM;
diff -upN reference/mm/swapfile.c current/mm/swapfile.c
--- reference/mm/swapfile.c 2004-03-25 02:43:43.000000000 +0000
+++ current/mm/swapfile.c 2004-03-25 15:03:32.000000000 +0000
@@ -1048,7 +1048,7 @@ asmlinkage long sys_swapoff(const char _
swap_list_unlock();
goto out_dput;
}
- if (!security_vm_enough_memory(p->pages))
+ if (!security_vm_enough_memory(VM_AD_DEFAULT, p->pages))
vm_unacct_memory(p->pages);
else {
err = -ENOMEM;
diff -upN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c 2004-03-25 02:43:44.000000000 +0000
+++ current/security/commoncap.c 2004-03-25 15:03:32.000000000 +0000
@@ -308,10 +308,16 @@ int cap_syslog (int type)
* Strict overcommit modes added 2002 Feb 26 by Alan Cox.
* Additional code 2002 Jul 20 by Robert Love.
*/
-int cap_vm_enough_memory(long pages)
+int cap_vm_enough_memory(int domain, long pages)
{
unsigned long free, allowed;

+ /* We only account for the default memory domain, assume overcommit
+ * for all others.
+ */
+ if (domain != VM_AD_DEFAULT)
+ return 0;
+
vm_acct_memory(pages);

/*
diff -upN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c 2004-03-25 02:43:44.000000000 +0000
+++ current/security/dummy.c 2004-03-25 15:03:32.000000000 +0000
@@ -109,10 +109,16 @@ static int dummy_syslog (int type)
* We currently support three overcommit policies, which are set via the
* vm.overcommit_memory sysctl. See Documentation/vm/overcommit-accounting
*/
-static int dummy_vm_enough_memory(long pages)
+static int dummy_vm_enough_memory(int domain, long pages)
{
unsigned long free, allowed;

+ /* We only account for the default memory domain, assume overcommit
+ * for all others.
+ */
+ if (domain != VM_AD_DEFAULT)
+ return 0;
+
vm_acct_memory(pages);

/*
diff -upN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c 2004-03-25 02:43:44.000000000 +0000
+++ current/security/selinux/hooks.c 2004-03-25 15:03:32.000000000 +0000
@@ -1496,12 +1496,18 @@ static int selinux_syslog(int type)
* Strict overcommit modes added 2002 Feb 26 by Alan Cox.
* Additional code 2002 Jul 20 by Robert Love.
*/
-static int selinux_vm_enough_memory(long pages)
+static int selinux_vm_enough_memory(int domain, long pages)
{
unsigned long free, allowed;
int rc;
struct task_security_struct *tsec = current->security;

+ /* We only account for the default memory domain, assume overcommit
+ * for all others.
+ */
+ if (domain != VM_AD_DEFAULT)
+ return 0;
+
vm_acct_memory(pages);

/*

2004-03-25 17:00:47

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH] [4/6] HUGETLB memory commitment

[070-mem_acctdom_hugetlb]

Convert hugetlb to accounting domains (core)

---
fs/hugetlbfs/inode.c | 45 ++++++++++++++++++++++++++++++++++++++-------
include/linux/hugetlb.h | 5 +++++
security/commoncap.c | 9 +++++++++
security/dummy.c | 9 +++++++++
security/selinux/hooks.c | 9 +++++++++
5 files changed, 70 insertions(+), 7 deletions(-)

diff -upN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c 2004-03-25 02:43:00.000000000 +0000
+++ current/fs/hugetlbfs/inode.c 2004-03-25 15:03:33.000000000 +0000
@@ -26,12 +26,15 @@
#include <linux/dnotify.h>
#include <linux/statfs.h>
#include <linux/security.h>
+#include <linux/mman.h>

#include <asm/uaccess.h>

/* some random number */
#define HUGETLBFS_MAGIC 0x958458f6

+#define VM_ACCT(size) (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
+
static struct super_operations hugetlbfs_ops;
static struct address_space_operations hugetlbfs_aops;
struct file_operations hugetlbfs_file_operations;
@@ -191,6 +194,7 @@ void truncate_hugepages(struct address_s
static void hugetlbfs_delete_inode(struct inode *inode)
{
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(inode->i_sb);
+ long change;

hlist_del_init(&inode->i_hash);
list_del_init(&inode->i_list);
@@ -198,6 +202,9 @@ static void hugetlbfs_delete_inode(struc
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);

+ change = VM_ACCT(inode->i_size) - VM_ACCT(0);
+ if (change)
+ vm_unacct_memory(VM_AD_HUGETLB, change);
if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);

@@ -217,6 +224,7 @@ static void hugetlbfs_forget_inode(struc
{
struct super_block *super_block = inode->i_sb;
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(super_block);
+ long change;

if (hlist_unhashed(&inode->i_hash))
goto out_truncate;
@@ -239,6 +247,9 @@ out_truncate:
inode->i_state |= I_FREEING;
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
+ change = VM_ACCT(inode->i_size) - VM_ACCT(0);
+ if (change)
+ vm_unacct_memory(VM_AD_HUGETLB, change);
if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);

@@ -312,8 +323,10 @@ static int hugetlb_vmtruncate(struct ino
unsigned long pgoff;
struct address_space *mapping = inode->i_mapping;

+ /*
if (offset > inode->i_size)
return -EINVAL;
+ */

BUG_ON(offset & ~HPAGE_MASK);
pgoff = offset >> HPAGE_SHIFT;
@@ -334,6 +347,8 @@ static int hugetlbfs_setattr(struct dent
struct inode *inode = dentry->d_inode;
int error;
unsigned int ia_valid = attr->ia_valid;
+ long change = 0;
+ loff_t csize;

BUG_ON(!inode);

@@ -345,15 +360,27 @@ static int hugetlbfs_setattr(struct dent
if (error)
goto out;
if (ia_valid & ATTR_SIZE) {
+ csize = i_size_read(inode);
error = -EINVAL;
- if (!(attr->ia_size & ~HPAGE_MASK))
- error = hugetlb_vmtruncate(inode, attr->ia_size);
- if (error)
+ if (!(attr->ia_size & ~HPAGE_MASK))
+ goto out;
+ if (attr->ia_size > csize)
goto out;
+ change = VM_ACCT(csize) - VM_ACCT(attr->ia_size);
+ if (change)
+ vm_unacct_memory(VM_AD_HUGETLB, change);
+ /* XXX: here we commit to removing the mappings, should we do
+ * this before we attmempt to write the inode or after. What
+ * should we do if it fails?
+ */
+ hugetlb_vmtruncate(inode, attr->ia_size);
attr->ia_valid &= ~ATTR_SIZE;
}
error = inode_setattr(inode, attr);
out:
+ if (error && change)
+ vm_acct_memory(VM_AD_HUGETLB, change);
+
return error;
}

@@ -710,17 +737,19 @@ struct file *hugetlb_zero_setup(size_t s
if (!capable(CAP_IPC_LOCK))
return ERR_PTR(-EPERM);

- if (!is_hugepage_mem_enough(size))
+ if (security_vm_enough_memory(VM_AD_HUGETLB, VM_ACCT(size)))
return ERR_PTR(-ENOMEM);
-
+
root = hugetlbfs_vfsmount->mnt_root;
snprintf(buf, 16, "%lu", hugetlbfs_counter());
quick_string.name = buf;
quick_string.len = strlen(quick_string.name);
quick_string.hash = 0;
dentry = d_alloc(root, &quick_string);
- if (!dentry)
- return ERR_PTR(-ENOMEM);
+ if (!dentry) {
+ error = -ENOMEM;
+ goto out_committed;
+ }

error = -ENFILE;
file = get_empty_filp();
@@ -747,6 +776,8 @@ out_file:
put_filp(file);
out_dentry:
dput(dentry);
+out_committed:
+ vm_unacct_memory(VM_AD_HUGETLB, VM_ACCT(size));
return ERR_PTR(error);
}

diff -upN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h
--- reference/include/linux/hugetlb.h 2004-02-23 18:15:09.000000000 +0000
+++ current/include/linux/hugetlb.h 2004-03-25 15:03:33.000000000 +0000
@@ -19,6 +19,7 @@ int hugetlb_prefault(struct address_spac
void huge_page_release(struct page *);
int hugetlb_report_meminfo(char *);
int is_hugepage_mem_enough(size_t);
+unsigned long hugetlb_total_pages(void);
struct page *follow_huge_addr(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write);
struct vm_area_struct *hugepage_vma(struct mm_struct *mm,
@@ -48,6 +49,10 @@ static inline int is_vm_hugetlb_page(str
{
return 0;
}
+static inline unsigned long hugetlb_total_pages(void)
+{
+ return 0;
+}

#define follow_hugetlb_page(m,v,p,vs,a,b,i) ({ BUG(); 0; })
#define follow_huge_addr(mm, vma, addr, write) 0
diff -upN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c 2004-03-25 15:03:32.000000000 +0000
+++ current/security/commoncap.c 2004-03-25 15:03:33.000000000 +0000
@@ -22,6 +22,7 @@
#include <linux/netlink.h>
#include <linux/ptrace.h>
#include <linux/xattr.h>
+#include <linux/hugetlb.h>

int cap_capable (struct task_struct *tsk, int cap)
{
@@ -314,6 +315,13 @@ int cap_vm_enough_memory(int domain, lon

vm_acct_memory(domain, pages);

+ /* Check against the full compliment of hugepages, no reserve. */
+ if (domain == VM_AD_HUGETLB) {
+ allowed = hugetlb_total_pages();
+
+ goto check;
+ }
+
/* We only account for the default memory domain, assume overcommit
* for all others.
*/
@@ -367,6 +375,7 @@ int cap_vm_enough_memory(int domain, lon
allowed = totalram_pages * sysctl_overcommit_ratio / 100;
allowed += total_swap_pages;

+check:
if (atomic_read(&vm_committed_space[domain]) < allowed)
return 0;

diff -upN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c 2004-03-25 15:03:32.000000000 +0000
+++ current/security/dummy.c 2004-03-25 15:03:33.000000000 +0000
@@ -25,6 +25,7 @@
#include <linux/netlink.h>
#include <net/sock.h>
#include <linux/xattr.h>
+#include <linux/hugetlb.h>

static int dummy_ptrace (struct task_struct *parent, struct task_struct *child)
{
@@ -115,6 +116,13 @@ static int dummy_vm_enough_memory(int do

vm_acct_memory(domain, pages);

+ /* Check against the full compliment of hugepages, no reserve. */
+ if (domain == VM_AD_HUGETLB) {
+ allowed = hugetlb_total_pages();
+
+ goto check;
+ }
+
/* We only account for the default memory domain, assume overcommit
* for all others.
*/
@@ -155,6 +163,7 @@ static int dummy_vm_enough_memory(int do
allowed = totalram_pages * sysctl_overcommit_ratio / 100;
allowed += total_swap_pages;

+check:
if (atomic_read(&vm_committed_space[domain]) < allowed)
return 0;

diff -upN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c 2004-03-25 15:03:32.000000000 +0000
+++ current/security/selinux/hooks.c 2004-03-25 15:03:33.000000000 +0000
@@ -59,6 +59,7 @@
#include <net/af_unix.h> /* for Unix socket types */
#include <linux/parser.h>
#include <linux/nfs_mount.h>
+#include <linux/hugetlb.h>

#include "avc.h"
#include "objsec.h"
@@ -1504,6 +1505,13 @@ static int selinux_vm_enough_memory(int

vm_acct_memory(domain, pages);

+ /* Check against the full compliment of hugepages, no reserve. */
+ if (domain == VM_AD_HUGETLB) {
+ allowed = hugetlb_total_pages();
+
+ goto check;
+ }
+
/* We only account for the default memory domain, assume overcommit
* for all others.
*/
@@ -1553,6 +1561,7 @@ static int selinux_vm_enough_memory(int
allowed = totalram_pages * sysctl_overcommit_ratio / 100;
allowed += total_swap_pages;

+check:
if (atomic_read(&vm_committed_space[domain]) < allowed)
return 0;


2004-03-25 17:03:38

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH] [3/6] HUGETLB memory commitment

[060-mem_acctdom_commitments]

Split vm_commited_space per domain

Currently only normal page commitments are tracked. This patch
provides a framework for tracking page commitments in multiple
independent domains. With this patch vm_commited_space becomes a

---
fs/proc/proc_misc.c | 2 +-
include/linux/mm.h | 13 +++++++++++--
include/linux/mman.h | 12 ++++++------
kernel/fork.c | 8 +++-----
mm/memory.c | 12 +++++++++---
mm/mmap.c | 23 ++++++++++++-----------
mm/mprotect.c | 5 ++---
mm/mremap.c | 12 ++++++------
mm/nommu.c | 3 ++-
mm/shmem.c | 13 +++++++------
mm/swap.c | 17 +++++++++++++----
mm/swapfile.c | 4 +++-
security/commoncap.c | 10 +++++-----
security/dummy.c | 10 +++++-----
security/selinux/hooks.c | 10 +++++-----
15 files changed, 90 insertions(+), 64 deletions(-)

diff -upN reference/fs/proc/proc_misc.c current/fs/proc/proc_misc.c
--- reference/fs/proc/proc_misc.c 2004-03-25 15:03:28.000000000 +0000
+++ current/fs/proc/proc_misc.c 2004-03-25 15:03:32.000000000 +0000
@@ -174,7 +174,7 @@ static int meminfo_read_proc(char *page,
#define K(x) ((x) << (PAGE_SHIFT - 10))
si_meminfo(&i);
si_swapinfo(&i);
- committed = atomic_read(&vm_committed_space);
+ committed = atomic_read(&vm_committed_space[VM_AD_DEFAULT]);

vmtot = (VMALLOC_END-VMALLOC_START)>>10;
vmi = get_vmalloc_info();
diff -upN reference/include/linux/mman.h current/include/linux/mman.h
--- reference/include/linux/mman.h 2004-01-09 06:59:09.000000000 +0000
+++ current/include/linux/mman.h 2004-03-25 15:03:32.000000000 +0000
@@ -12,20 +12,20 @@

extern int sysctl_overcommit_memory;
extern int sysctl_overcommit_ratio;
-extern atomic_t vm_committed_space;
+extern atomic_t vm_committed_space[];

#ifdef CONFIG_SMP
-extern void vm_acct_memory(long pages);
+extern void vm_acct_memory(int domain, long pages);
#else
-static inline void vm_acct_memory(long pages)
+static inline void vm_acct_memory(int domain, long pages)
{
- atomic_add(pages, &vm_committed_space);
+ atomic_add(pages, &vm_committed_space[domain]);
}
#endif

-static inline void vm_unacct_memory(long pages)
+static inline void vm_unacct_memory(int domain, long pages)
{
- vm_acct_memory(-pages);
+ vm_acct_memory(domain, -pages);
}

/*
diff -upN reference/include/linux/mm.h current/include/linux/mm.h
--- reference/include/linux/mm.h 2004-03-25 15:03:32.000000000 +0000
+++ current/include/linux/mm.h 2004-03-25 15:03:32.000000000 +0000
@@ -117,7 +117,16 @@ struct vm_area_struct {
#define VM_ACCTDOM(vma) (!!((vma)->vm_flags & VM_HUGETLB))
#define VM_AD_DEFAULT 0
#define VM_AD_HUGETLB 1
-
+typedef struct {
+ long vec[VM_ACCTDOM_NR];
+} madv_t;
+#define MADV_NONE { {[0 ... VM_ACCTDOM_NR-1] = 0UL} }
+static inline void madv_add(madv_t *madv, int domain, long size)
+{
+ madv->vec[domain] += size;
+}
+void vm_unacct_memory_domains(madv_t *madv);
+
#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
#endif
@@ -446,7 +455,7 @@ void zap_page_range(struct vm_area_struc
unsigned long size);
int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm,
struct vm_area_struct *start_vma, unsigned long start_addr,
- unsigned long end_addr, unsigned long *nr_accounted);
+ unsigned long end_addr, madv_t *nr_accounted);
void unmap_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long address, unsigned long size);
void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr);
diff -upN reference/kernel/fork.c current/kernel/fork.c
--- reference/kernel/fork.c 2004-03-25 15:03:32.000000000 +0000
+++ current/kernel/fork.c 2004-03-25 15:03:32.000000000 +0000
@@ -267,7 +267,7 @@ static inline int dup_mmap(struct mm_str
struct vm_area_struct * mpnt, *tmp, **pprev;
struct rb_node **rb_link, *rb_parent;
int retval;
- unsigned long charge = 0;
+ madv_t charge = MADV_NONE;

down_write(&oldmm->mmap_sem);
flush_cache_mm(current->mm);
@@ -303,8 +303,7 @@ static inline int dup_mmap(struct mm_str
unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
if (security_vm_enough_memory(VM_ACCTDOM(mpnt), len))
goto fail_nomem;
- if (VM_ACCTDOM(mpnt) == VM_AD_DEFAULT)
- charge += len;
+ madv_add(&charge, VM_ACCTDOM(mpnt), len);
}
tmp = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (!tmp)
@@ -359,8 +358,7 @@ out:
fail_nomem:
retval = -ENOMEM;
fail:
- if (charge)
- vm_unacct_memory(charge);
+ vm_unacct_memory_domains(&charge);
goto out;
}
static inline int mm_alloc_pgd(struct mm_struct * mm)
diff -upN reference/mm/memory.c current/mm/memory.c
--- reference/mm/memory.c 2004-03-25 15:03:32.000000000 +0000
+++ current/mm/memory.c 2004-03-25 15:03:32.000000000 +0000
@@ -524,7 +524,7 @@ void unmap_page_range(struct mmu_gather
*/
int unmap_vmas(struct mmu_gather **tlbp, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long start_addr,
- unsigned long end_addr, unsigned long *nr_accounted)
+ unsigned long end_addr, madv_t *nr_accounted)
{
unsigned long zap_bytes = ZAP_BLOCK_SIZE;
unsigned long tlb_start = 0; /* For tlb_finish_mmu */
@@ -553,7 +553,8 @@ int unmap_vmas(struct mmu_gather **tlbp,

/* We assume that only accountable VMAs are VM_ACCOUNT. */
if (vma->vm_flags & VM_ACCOUNT)
- *nr_accounted += (end - start) >> PAGE_SHIFT;
+ madv_add(nr_accounted,
+ VM_ACCTDOM(vma), (end - start) >> PAGE_SHIFT);

ret++;
while (start != end) {
@@ -602,7 +603,12 @@ void zap_page_range(struct vm_area_struc
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather *tlb;
unsigned long end = address + size;
- unsigned long nr_accounted = 0;
+ madv_t nr_accounted = MADV_NONE;
+
+ /* XXX: we seem to avoid thinking about the memory accounting
+ * for both the hugepages where don't bother even tracking it and
+ * in the normal path where we figure it out and do nothing with it??
+ */

might_sleep();

diff -upN reference/mm/mmap.c current/mm/mmap.c
--- reference/mm/mmap.c 2004-03-25 15:03:32.000000000 +0000
+++ current/mm/mmap.c 2004-03-25 15:03:32.000000000 +0000
@@ -54,7 +54,8 @@ pgprot_t protection_map[16] = {

int sysctl_overcommit_memory = 0; /* default is heuristic overcommit */
int sysctl_overcommit_ratio = 50; /* default is 50% */
-atomic_t vm_committed_space = ATOMIC_INIT(0);
+atomic_t vm_committed_space[VM_ACCTDOM_NR] =
+ { [ 0 ... VM_ACCTDOM_NR-1 ] = ATOMIC_INIT(0) };

EXPORT_SYMBOL(sysctl_overcommit_memory);
EXPORT_SYMBOL(sysctl_overcommit_ratio);
@@ -611,8 +612,8 @@ munmap_back:
> current->rlim[RLIMIT_AS].rlim_cur)
return -ENOMEM;

- if (acctdom == VM_AD_DEFAULT && (!(flags & MAP_NORESERVE) ||
- sysctl_overcommit_memory > 1)) {
+ if (!(flags & MAP_NORESERVE) ||
+ (acctdom == VM_AD_DEFAULT && sysctl_overcommit_memory > 1)) {
if (vm_flags & VM_SHARED) {
/* Check memory availability in shmem_file_setup? */
vm_flags |= VM_ACCOUNT;
@@ -730,7 +731,7 @@ free_vma:
kmem_cache_free(vm_area_cachep, vma);
unacct_error:
if (charged)
- vm_unacct_memory(charged);
+ vm_unacct_memory(acctdom, charged);
return error;
}

@@ -940,7 +941,7 @@ int expand_stack(struct vm_area_struct *
((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
current->rlim[RLIMIT_AS].rlim_cur) {
spin_unlock(&vma->vm_mm->page_table_lock);
- vm_unacct_memory(grow);
+ vm_unacct_memory(VM_AD_DEFAULT, grow);
return -ENOMEM;
}
vma->vm_end = address;
@@ -994,7 +995,7 @@ int expand_stack(struct vm_area_struct *
((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) >
current->rlim[RLIMIT_AS].rlim_cur) {
spin_unlock(&vma->vm_mm->page_table_lock);
- vm_unacct_memory(grow);
+ vm_unacct_memory(VM_AD_DEFAULT, grow);
return -ENOMEM;
}
vma->vm_start = address;
@@ -1152,12 +1153,12 @@ static void unmap_region(struct mm_struc
unsigned long end)
{
struct mmu_gather *tlb;
- unsigned long nr_accounted = 0;
+ madv_t nr_accounted = MADV_NONE;

lru_add_drain();
tlb = tlb_gather_mmu(mm, 0);
unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted);
- vm_unacct_memory(nr_accounted);
+ vm_unacct_memory_domains(&nr_accounted);

if (is_hugepage_only_range(start, end - start))
hugetlb_free_pgtables(tlb, prev, start, end);
@@ -1397,7 +1398,7 @@ unsigned long do_brk(unsigned long addr,
*/
vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (!vma) {
- vm_unacct_memory(len >> PAGE_SHIFT);
+ vm_unacct_memory(VM_AD_DEFAULT, len >> PAGE_SHIFT);
return -ENOMEM;
}

@@ -1430,7 +1431,7 @@ void exit_mmap(struct mm_struct *mm)
{
struct mmu_gather *tlb;
struct vm_area_struct *vma;
- unsigned long nr_accounted = 0;
+ madv_t nr_accounted = MADV_NONE;

profile_exit_mmap(mm);

@@ -1443,7 +1444,7 @@ void exit_mmap(struct mm_struct *mm)
/* Use ~0UL here to ensure all VMAs in the mm are unmapped */
mm->map_count -= unmap_vmas(&tlb, mm, mm->mmap, 0,
~0UL, &nr_accounted);
- vm_unacct_memory(nr_accounted);
+ vm_unacct_memory_domains(&nr_accounted);
BUG_ON(mm->map_count); /* This is just debugging */
clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
diff -upN reference/mm/mprotect.c current/mm/mprotect.c
--- reference/mm/mprotect.c 2004-03-25 15:03:32.000000000 +0000
+++ current/mm/mprotect.c 2004-03-25 15:03:32.000000000 +0000
@@ -173,8 +173,7 @@ mprotect_fixup(struct vm_area_struct *vm
* a MAP_NORESERVE private mapping to writable will now reserve.
*/
if (newflags & VM_WRITE) {
- if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED)) &&
- VM_ACCTDOM(vma) == VM_AD_DEFAULT) {
+ if (!(vma->vm_flags & (VM_ACCOUNT|VM_WRITE|VM_SHARED))) {
charged = (end - start) >> PAGE_SHIFT;
if (security_vm_enough_memory(VM_ACCTDOM(vma), charged))
return -ENOMEM;
@@ -218,7 +217,7 @@ success:
return 0;

fail:
- vm_unacct_memory(charged);
+ vm_unacct_memory(VM_ACCTDOM(vma), charged);
return error;
}

diff -upN reference/mm/mremap.c current/mm/mremap.c
--- reference/mm/mremap.c 2004-03-25 15:03:32.000000000 +0000
+++ current/mm/mremap.c 2004-03-25 15:03:32.000000000 +0000
@@ -401,7 +401,7 @@ unsigned long do_mremap(unsigned long ad
if (vma->vm_flags & VM_ACCOUNT) {
charged = (new_len - old_len) >> PAGE_SHIFT;
if (security_vm_enough_memory(VM_ACCTDOM(vma), charged))
- goto out_nc;
+ goto out;
}

/* old_len exactly to the end of the area..
@@ -426,7 +426,7 @@ unsigned long do_mremap(unsigned long ad
addr + new_len);
}
ret = addr;
- goto out;
+ goto out_commited;
}
}

@@ -445,14 +445,14 @@ unsigned long do_mremap(unsigned long ad
vma->vm_pgoff, map_flags);
ret = new_addr;
if (new_addr & ~PAGE_MASK)
- goto out;
+ goto out_commited;
}
ret = move_vma(vma, addr, old_len, new_len, new_addr);
}
-out:
+out_commited:
if (ret & ~PAGE_MASK)
- vm_unacct_memory(charged);
-out_nc:
+ vm_unacct_memory(VM_ACCTDOM(vma), charged);
+out:
return ret;
}

diff -upN reference/mm/nommu.c current/mm/nommu.c
--- reference/mm/nommu.c 2004-02-04 15:09:16.000000000 +0000
+++ current/mm/nommu.c 2004-03-25 15:03:32.000000000 +0000
@@ -29,7 +29,8 @@ struct page *mem_map;
unsigned long max_mapnr;
unsigned long num_physpages;
unsigned long askedalloc, realalloc;
-atomic_t vm_committed_space = ATOMIC_INIT(0);
+atomic_t vm_committed_space[VM_ACCTDOM_NR] =
+ { [ 0 ... VM_ACCTDOM_NR-1 ] = ATOMIC_INIT(0) };
int sysctl_overcommit_memory; /* default is heuristic overcommit */
int sysctl_overcommit_ratio = 50; /* default is 50% */

diff -upN reference/mm/shmem.c current/mm/shmem.c
--- reference/mm/shmem.c 2004-03-25 15:03:32.000000000 +0000
+++ current/mm/shmem.c 2004-03-25 15:03:32.000000000 +0000
@@ -529,7 +529,7 @@ static int shmem_notify_change(struct de
if (security_vm_enough_memory(VM_AD_DEFAULT, change))
return -ENOMEM;
} else if (attr->ia_size < inode->i_size) {
- vm_unacct_memory(-change);
+ vm_unacct_memory(VM_AD_DEFAULT, -change);
/*
* If truncating down to a partial page, then
* if that page is already allocated, hold it
@@ -564,7 +564,7 @@ static int shmem_notify_change(struct de
if (page)
page_cache_release(page);
if (error)
- vm_unacct_memory(change);
+ vm_unacct_memory(VM_AD_DEFAULT, change);
return error;
}

@@ -578,7 +578,7 @@ static void shmem_delete_inode(struct in
list_del(&info->list);
spin_unlock(&shmem_ilock);
if (info->flags & VM_ACCOUNT)
- vm_unacct_memory(VM_ACCT(inode->i_size));
+ vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(inode->i_size));
inode->i_size = 0;
shmem_truncate(inode);
}
@@ -1271,7 +1271,8 @@ shmem_file_write(struct file *file, cons

/* Short writes give back address space */
if (inode->i_size != maxpos)
- vm_unacct_memory(VM_ACCT(maxpos) - VM_ACCT(inode->i_size));
+ vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(maxpos) -
+ VM_ACCT(inode->i_size));
out:
up(&inode->i_sem);
return err;
@@ -1558,7 +1559,7 @@ static int shmem_symlink(struct inode *d
}
error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL);
if (error) {
- vm_unacct_memory(VM_ACCT(1));
+ vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(1));
iput(inode);
return error;
}
@@ -1988,7 +1989,7 @@ put_dentry:
dput(dentry);
put_memory:
if (flags & VM_ACCOUNT)
- vm_unacct_memory(VM_ACCT(size));
+ vm_unacct_memory(VM_AD_DEFAULT, VM_ACCT(size));
return ERR_PTR(error);
}

diff -upN reference/mm/swap.c current/mm/swap.c
--- reference/mm/swap.c 2004-03-25 02:43:43.000000000 +0000
+++ current/mm/swap.c 2004-03-25 15:03:32.000000000 +0000
@@ -368,17 +368,18 @@ unsigned int pagevec_lookup(struct pagev
*/
#define ACCT_THRESHOLD max(16, NR_CPUS * 2)

-static DEFINE_PER_CPU(long, committed_space) = 0;
+/* XXX: zero this????? */
+static DEFINE_PER_CPU(long, committed_space[VM_ACCTDOM_NR]);

-void vm_acct_memory(long pages)
+void vm_acct_memory(int domain, long pages)
{
long *local;

preempt_disable();
- local = &__get_cpu_var(committed_space);
+ local = &__get_cpu_var(committed_space[domain]);
*local += pages;
if (*local > ACCT_THRESHOLD || *local < -ACCT_THRESHOLD) {
- atomic_add(*local, &vm_committed_space);
+ atomic_add(*local, &vm_committed_space[domain]);
*local = 0;
}
preempt_enable();
@@ -416,6 +417,14 @@ static int cpu_swap_callback(struct noti
#endif /* CONFIG_HOTPLUG_CPU */
#endif /* CONFIG_SMP */

+void vm_unacct_memory_domains(madv_t *adv)
+{
+ if (adv->vec[0])
+ vm_unacct_memory(VM_AD_DEFAULT, adv->vec[0]);
+ if (adv->vec[1])
+ vm_unacct_memory(VM_AD_DEFAULT, adv->vec[1]);
+}
+
#ifdef CONFIG_SMP
void percpu_counter_mod(struct percpu_counter *fbc, long amount)
{
diff -upN reference/mm/swapfile.c current/mm/swapfile.c
--- reference/mm/swapfile.c 2004-03-25 15:03:32.000000000 +0000
+++ current/mm/swapfile.c 2004-03-25 15:03:32.000000000 +0000
@@ -1048,8 +1048,10 @@ asmlinkage long sys_swapoff(const char _
swap_list_unlock();
goto out_dput;
}
+ /* There is an assumption here that we only may have swapped things
+ * from the default memory accounting domain to this device. */
if (!security_vm_enough_memory(VM_AD_DEFAULT, p->pages))
- vm_unacct_memory(p->pages);
+ vm_unacct_memory(VM_AD_DEFAULT, p->pages);
else {
err = -ENOMEM;
swap_list_unlock();
diff -upN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c 2004-03-25 15:03:32.000000000 +0000
+++ current/security/commoncap.c 2004-03-25 15:03:32.000000000 +0000
@@ -312,14 +312,14 @@ int cap_vm_enough_memory(int domain, lon
{
unsigned long free, allowed;

+ vm_acct_memory(domain, pages);
+
/* We only account for the default memory domain, assume overcommit
* for all others.
*/
if (domain != VM_AD_DEFAULT)
return 0;

- vm_acct_memory(pages);
-
/*
* Sometimes we want to use more memory than we have
*/
@@ -360,17 +360,17 @@ int cap_vm_enough_memory(int domain, lon

if (free > pages)
return 0;
- vm_unacct_memory(pages);
+ vm_unacct_memory(domain, pages);
return -ENOMEM;
}

allowed = totalram_pages * sysctl_overcommit_ratio / 100;
allowed += total_swap_pages;

- if (atomic_read(&vm_committed_space) < allowed)
+ if (atomic_read(&vm_committed_space[domain]) < allowed)
return 0;

- vm_unacct_memory(pages);
+ vm_unacct_memory(domain, pages);

return -ENOMEM;
}
diff -upN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c 2004-03-25 15:03:32.000000000 +0000
+++ current/security/dummy.c 2004-03-25 15:03:32.000000000 +0000
@@ -113,14 +113,14 @@ static int dummy_vm_enough_memory(int do
{
unsigned long free, allowed;

+ vm_acct_memory(domain, pages);
+
/* We only account for the default memory domain, assume overcommit
* for all others.
*/
if (domain != VM_AD_DEFAULT)
return 0;

- vm_acct_memory(pages);
-
/*
* Sometimes we want to use more memory than we have
*/
@@ -148,17 +148,17 @@ static int dummy_vm_enough_memory(int do

if (free > pages)
return 0;
- vm_unacct_memory(pages);
+ vm_unacct_memory(domain, pages);
return -ENOMEM;
}

allowed = totalram_pages * sysctl_overcommit_ratio / 100;
allowed += total_swap_pages;

- if (atomic_read(&vm_committed_space) < allowed)
+ if (atomic_read(&vm_committed_space[domain]) < allowed)
return 0;

- vm_unacct_memory(pages);
+ vm_unacct_memory(domain, pages);

return -ENOMEM;
}
diff -upN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c 2004-03-25 15:03:32.000000000 +0000
+++ current/security/selinux/hooks.c 2004-03-25 15:03:32.000000000 +0000
@@ -1502,14 +1502,14 @@ static int selinux_vm_enough_memory(int
int rc;
struct task_security_struct *tsec = current->security;

+ vm_acct_memory(domain, pages);
+
/* We only account for the default memory domain, assume overcommit
* for all others.
*/
if (domain != VM_AD_DEFAULT)
return 0;

- vm_acct_memory(pages);
-
/*
* Sometimes we want to use more memory than we have
*/
@@ -1546,17 +1546,17 @@ static int selinux_vm_enough_memory(int

if (free > pages)
return 0;
- vm_unacct_memory(pages);
+ vm_unacct_memory(domain, pages);
return -ENOMEM;
}

allowed = totalram_pages * sysctl_overcommit_ratio / 100;
allowed += total_swap_pages;

- if (atomic_read(&vm_committed_space) < allowed)
+ if (atomic_read(&vm_committed_space[domain]) < allowed)
return 0;

- vm_unacct_memory(pages);
+ vm_unacct_memory(domain, pages);

return -ENOMEM;
}

2004-03-25 17:08:07

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH] [5/6] HUGETLB memory commitment

[075-mem_acctdom_hugetlb_arch]

Convert hugetlb to accounting domains (arch)

---
i386/mm/hugetlbpage.c | 16 +++++++++++++---
ia64/mm/hugetlbpage.c | 16 +++++++++++++---
ppc64/mm/hugetlbpage.c | 16 +++++++++++++---
sparc64/mm/hugetlbpage.c | 16 +++++++++++++---
4 files changed, 52 insertions(+), 12 deletions(-)

diff -upN reference/arch/i386/mm/hugetlbpage.c current/arch/i386/mm/hugetlbpage.c
--- reference/arch/i386/mm/hugetlbpage.c 2004-01-09 07:00:02.000000000 +0000
+++ current/arch/i386/mm/hugetlbpage.c 2004-03-25 15:03:27.000000000 +0000
@@ -15,7 +15,7 @@
#include <linux/module.h>
#include <linux/err.h>
#include <linux/sysctl.h>
-#include <asm/mman.h>
+#include <linux/mman.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
@@ -513,13 +513,17 @@ module_init(hugetlb_init);

int hugetlb_report_meminfo(char *buf)
{
+ int committed = atomic_read(&vm_committed_space[VM_AD_HUGETLB]);
+#define K(x) ((x) << (PAGE_SHIFT - 10))
return sprintf(buf,
"HugePages_Total: %5lu\n"
"HugePages_Free: %5lu\n"
- "Hugepagesize: %5lu kB\n",
+ "Hugepagesize: %5lu kB\n"
+ "HugeCommited_AS: %8u kB\n",
htlbzone_pages,
htlbpagemem,
- HPAGE_SIZE/1024);
+ HPAGE_SIZE/1024,
+ K(committed));
}

int is_hugepage_mem_enough(size_t size)
@@ -527,6 +531,12 @@ int is_hugepage_mem_enough(size_t size)
return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpagemem;
}

+/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
+unsigned long hugetlb_total_pages(void)
+{
+ return htlbzone_pages * (HPAGE_SIZE / PAGE_SIZE);
+}
+
/*
* We cannot handle pagefaults against hugetlb pages at all. They cause
* handle_mm_fault() to try to instantiate regular-sized pages in the
diff -upN reference/arch/ia64/mm/hugetlbpage.c current/arch/ia64/mm/hugetlbpage.c
--- reference/arch/ia64/mm/hugetlbpage.c 2004-03-11 20:47:12.000000000 +0000
+++ current/arch/ia64/mm/hugetlbpage.c 2004-03-25 15:03:27.000000000 +0000
@@ -17,7 +17,7 @@
#include <linux/smp_lock.h>
#include <linux/slab.h>
#include <linux/sysctl.h>
-#include <asm/mman.h>
+#include <linux/mman.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
@@ -576,13 +576,17 @@ __initcall(hugetlb_init);

int hugetlb_report_meminfo(char *buf)
{
+ int committed = atomic_read(&vm_committed_space[VM_AD_HUGETLB]);
+#define K(x) ((x) << (PAGE_SHIFT - 10))
return sprintf(buf,
"HugePages_Total: %5lu\n"
"HugePages_Free: %5lu\n"
- "Hugepagesize: %5lu kB\n",
+ "Hugepagesize: %5lu kB\n"
+ "HugeCommited_AS: %8u kB\n",
htlbzone_pages,
htlbpagemem,
- HPAGE_SIZE/1024);
+ HPAGE_SIZE/1024,
+ K(committed));
}

int is_hugepage_mem_enough(size_t size)
@@ -592,6 +596,12 @@ int is_hugepage_mem_enough(size_t size)
return 1;
}

+/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
+unsigned long hugetlb_total_pages(void)
+{
+ return htlbzone_pages * (HPAGE_SIZE / PAGE_SIZE);
+}
+
static struct page *hugetlb_nopage(struct vm_area_struct * area, unsigned long address, int *unused)
{
BUG();
diff -upN reference/arch/ppc64/mm/hugetlbpage.c current/arch/ppc64/mm/hugetlbpage.c
--- reference/arch/ppc64/mm/hugetlbpage.c 2004-03-11 20:47:14.000000000 +0000
+++ current/arch/ppc64/mm/hugetlbpage.c 2004-03-25 15:03:27.000000000 +0000
@@ -17,7 +17,7 @@
#include <linux/module.h>
#include <linux/err.h>
#include <linux/sysctl.h>
-#include <asm/mman.h>
+#include <linux/mman.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
@@ -896,13 +896,17 @@ module_init(hugetlb_init);

int hugetlb_report_meminfo(char *buf)
{
+ int committed = atomic_read(&vm_committed_space[VM_AD_HUGETLB]);
+#define K(x) ((x) << (PAGE_SHIFT - 10))
return sprintf(buf,
"HugePages_Total: %5d\n"
"HugePages_Free: %5d\n"
- "Hugepagesize: %5lu kB\n",
+ "Hugepagesize: %5lu kB\n"
+ "HugeCommited_AS: %8u kB",
htlbpage_total,
htlbpage_free,
- HPAGE_SIZE/1024);
+ HPAGE_SIZE/1024,
+ K(commited));
}

/* This is advisory only, so we can get away with accesing
@@ -912,6 +916,12 @@ int is_hugepage_mem_enough(size_t size)
return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpage_free;
}

+/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
+int hugetlb_total_pages(void)
+{
+ return htlbpage_total * (HPAGE_SIZE / PAGE_SIZE);
+}
+
/*
* We cannot handle pagefaults against hugetlb pages at all. They cause
* handle_mm_fault() to try to instantiate regular-sized pages in the
diff -upN reference/arch/sparc64/mm/hugetlbpage.c current/arch/sparc64/mm/hugetlbpage.c
--- reference/arch/sparc64/mm/hugetlbpage.c 2004-01-09 06:59:45.000000000 +0000
+++ current/arch/sparc64/mm/hugetlbpage.c 2004-03-25 15:03:27.000000000 +0000
@@ -13,8 +13,8 @@
#include <linux/smp_lock.h>
#include <linux/slab.h>
#include <linux/sysctl.h>
+#include <linux/mman.h>

-#include <asm/mman.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
@@ -483,13 +483,17 @@ module_init(hugetlb_init);

int hugetlb_report_meminfo(char *buf)
{
+ int committed = atomic_read(&vm_committed_space[VM_AD_HUGETLB]);
+#define K(x) ((x) << (PAGE_SHIFT - 10))
return sprintf(buf,
"HugePages_Total: %5lu\n"
"HugePages_Free: %5lu\n"
- "Hugepagesize: %5lu kB\n",
+ "Hugepagesize: %5lu kB\n"
+ "HugeCommited_AS: %8u kB\n",
htlbzone_pages,
htlbpagemem,
- HPAGE_SIZE/1024);
+ HPAGE_SIZE/1024,
+ K(committed));
}

int is_hugepage_mem_enough(size_t size)
@@ -497,6 +501,12 @@ int is_hugepage_mem_enough(size_t size)
return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpagemem;
}

+/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
+int hugetlb_total_pages(void)
+{
+ return htlbzone_pages * (HPAGE_SIZE / PAGE_SIZE);
+}
+
/*
* We cannot handle pagefaults against hugetlb pages at all. They cause
* handle_mm_fault() to try to instantiate regular-sized pages in the

2004-03-25 17:09:16

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH] [6/6] HUGETLB memory commitment

[080-mem_acctdom_hugetlb_sysctl]

---
include/linux/mman.h | 4 ++--
include/linux/sysctl.h | 2 ++
kernel/sysctl.c | 28 ++++++++++++++++++++++------
mm/mmap.c | 11 +++++++----
mm/nommu.c | 8 ++++++--
security/commoncap.c | 19 ++++++++++---------
security/dummy.c | 19 ++++++++++---------
security/selinux/hooks.c | 19 ++++++++++---------
8 files changed, 69 insertions(+), 41 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/mman.h current/include/linux/mman.h
--- reference/include/linux/mman.h 2004-03-25 15:03:32.000000000 +0000
+++ current/include/linux/mman.h 2004-03-25 16:43:46.000000000 +0000
@@ -10,8 +10,8 @@
#define MREMAP_MAYMOVE 1
#define MREMAP_FIXED 2

-extern int sysctl_overcommit_memory;
-extern int sysctl_overcommit_ratio;
+extern int sysctl_overcommit_memory[];
+extern int sysctl_overcommit_ratio[];
extern atomic_t vm_committed_space[];

#ifdef CONFIG_SMP
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/sysctl.h current/include/linux/sysctl.h
--- reference/include/linux/sysctl.h 2004-03-11 20:47:28.000000000 +0000
+++ current/include/linux/sysctl.h 2004-03-25 16:45:06.000000000 +0000
@@ -158,6 +158,8 @@ enum
VM_SWAPPINESS=19, /* Tendency to steal mapped memory */
VM_LOWER_ZONE_PROTECTION=20,/* Amount of protection of lower zones */
VM_MIN_FREE_KBYTES=21, /* Minimum free kilobytes to maintain */
+ VM_OVERCOMMIT_MEMORY_HUGEPAGES=22, /* Turn off the virtual memory safety limit */
+ VM_OVERCOMMIT_RATIO_HUGEPAGES=23, /* percent of RAM to allow overcommit in */
};


diff -X /home/apw/lib/vdiff.excl -rupN reference/kernel/sysctl.c current/kernel/sysctl.c
--- reference/kernel/sysctl.c 2004-03-25 15:03:28.000000000 +0000
+++ current/kernel/sysctl.c 2004-03-25 16:44:46.000000000 +0000
@@ -50,8 +50,8 @@
/* External variables not in a header file. */
extern int panic_timeout;
extern int C_A_D;
-extern int sysctl_overcommit_memory;
-extern int sysctl_overcommit_ratio;
+extern int sysctl_overcommit_memory[];
+extern int sysctl_overcommit_ratio[];
extern int max_threads;
extern atomic_t nr_queued_signals;
extern int max_queued_signals;
@@ -628,16 +628,16 @@ static ctl_table vm_table[] = {
{
.ctl_name = VM_OVERCOMMIT_MEMORY,
.procname = "overcommit_memory",
- .data = &sysctl_overcommit_memory,
- .maxlen = sizeof(sysctl_overcommit_memory),
+ .data = &sysctl_overcommit_memory[VM_AD_DEFAULT],
+ .maxlen = sizeof(sysctl_overcommit_memory[VM_AD_DEFAULT]),
.mode = 0644,
.proc_handler = &proc_dointvec,
},
{
.ctl_name = VM_OVERCOMMIT_RATIO,
.procname = "overcommit_ratio",
- .data = &sysctl_overcommit_ratio,
- .maxlen = sizeof(sysctl_overcommit_ratio),
+ .data = &sysctl_overcommit_ratio[VM_AD_DEFAULT],
+ .maxlen = sizeof(sysctl_overcommit_ratio[VM_AD_DEFAULT]),
.mode = 0644,
.proc_handler = &proc_dointvec,
},
@@ -715,6 +715,22 @@ static ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = &hugetlb_sysctl_handler,
},
+ {
+ .ctl_name = VM_OVERCOMMIT_MEMORY_HUGEPAGES,
+ .procname = "overcommit_memory_hugepages",
+ .data = &sysctl_overcommit_memory[VM_AD_HUGETLB],
+ .maxlen = sizeof(sysctl_overcommit_memory[VM_AD_HUGETLB]),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = VM_OVERCOMMIT_RATIO_HUGEPAGES,
+ .procname = "overcommit_ratio_hugepages",
+ .data = &sysctl_overcommit_ratio[VM_AD_HUGETLB],
+ .maxlen = sizeof(sysctl_overcommit_ratio[VM_AD_HUGETLB]),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#endif
{
.ctl_name = VM_LOWER_ZONE_PROTECTION,
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/mmap.c current/mm/mmap.c
--- reference/mm/mmap.c 2004-03-25 15:03:32.000000000 +0000
+++ current/mm/mmap.c 2004-03-25 17:23:45.000000000 +0000
@@ -52,8 +52,12 @@ pgprot_t protection_map[16] = {
__S000, __S001, __S010, __S011, __S100, __S101, __S110, __S111
};

-int sysctl_overcommit_memory = 0; /* default is heuristic overcommit */
-int sysctl_overcommit_ratio = 50; /* default is 50% */
+/* Defaults are:
+ * VM_AD_DEFAULT heuristic overcommit, ratio 50%
+ * VM_AD_HUGETLB strict commit, ratio 100%
+ */
+int sysctl_overcommit_memory[VM_ACCTDOM_NR] = { 0, 0 };
+int sysctl_overcommit_ratio[VM_ACCTDOM_NR] = { 50, 100 };
atomic_t vm_committed_space[VM_ACCTDOM_NR] =
{ [ 0 ... VM_ACCTDOM_NR-1 ] = ATOMIC_INIT(0) };

@@ -612,8 +616,7 @@ munmap_back:
> current->rlim[RLIMIT_AS].rlim_cur)
return -ENOMEM;

- if (!(flags & MAP_NORESERVE) ||
- (acctdom == VM_AD_DEFAULT && sysctl_overcommit_memory > 1)) {
+ if (!(flags & MAP_NORESERVE) || sysctl_overcommit_memory[acctdom] > 1) {
if (vm_flags & VM_SHARED) {
/* Check memory availability in shmem_file_setup? */
vm_flags |= VM_ACCOUNT;
diff -X /home/apw/lib/vdiff.excl -rupN reference/mm/nommu.c current/mm/nommu.c
--- reference/mm/nommu.c 2004-03-25 15:03:32.000000000 +0000
+++ current/mm/nommu.c 2004-03-25 17:23:22.000000000 +0000
@@ -31,8 +31,12 @@ unsigned long num_physpages;
unsigned long askedalloc, realalloc;
atomic_t vm_committed_space[VM_ACCTDOM_NR] =
{ [ 0 ... VM_ACCTDOM_NR-1 ] = ATOMIC_INIT(0) };
-int sysctl_overcommit_memory; /* default is heuristic overcommit */
-int sysctl_overcommit_ratio = 50; /* default is 50% */
+/* Defaults are:
+ * VM_AD_DEFAULT heuristic overcommit, ratio 50%
+ * VM_AD_HUGETLB strict commit, ratio 100%
+ */
+int sysctl_overcommit_memory[VM_ACCTDOM_NR] = { 0, 0 };
+int sysctl_overcommit_ratio[VM_ACCTDOM_NR] = { 50, 100 };

/*
* Handle all mappings that got truncated by a "truncate()"
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/commoncap.c current/security/commoncap.c
--- reference/security/commoncap.c 2004-03-25 15:03:33.000000000 +0000
+++ current/security/commoncap.c 2004-03-25 17:15:17.000000000 +0000
@@ -315,9 +315,16 @@ int cap_vm_enough_memory(int domain, lon

vm_acct_memory(domain, pages);

+ /*
+ * Sometimes we want to use more memory than we have
+ */
+ if (sysctl_overcommit_memory[domain] == 1)
+ return 0;
+
/* Check against the full compliment of hugepages, no reserve. */
if (domain == VM_AD_HUGETLB) {
- allowed = hugetlb_total_pages();
+ allowed = hugetlb_total_pages() *
+ sysctl_overcommit_ratio[domain] / 100;

goto check;
}
@@ -328,13 +335,7 @@ int cap_vm_enough_memory(int domain, lon
if (domain != VM_AD_DEFAULT)
return 0;

- /*
- * Sometimes we want to use more memory than we have
- */
- if (sysctl_overcommit_memory == 1)
- return 0;
-
- if (sysctl_overcommit_memory == 0) {
+ if (sysctl_overcommit_memory[domain] == 0) {
unsigned long n;

free = get_page_cache_size();
@@ -372,7 +373,7 @@ int cap_vm_enough_memory(int domain, lon
return -ENOMEM;
}

- allowed = totalram_pages * sysctl_overcommit_ratio / 100;
+ allowed = totalram_pages * sysctl_overcommit_ratio[domain] / 100;
allowed += total_swap_pages;

check:
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/dummy.c current/security/dummy.c
--- reference/security/dummy.c 2004-03-25 15:03:33.000000000 +0000
+++ current/security/dummy.c 2004-03-25 17:16:21.000000000 +0000
@@ -116,9 +116,16 @@ static int dummy_vm_enough_memory(int do

vm_acct_memory(domain, pages);

+ /*
+ * Sometimes we want to use more memory than we have
+ */
+ if (sysctl_overcommit_memory[domain] == 1)
+ return 0;
+
/* Check against the full compliment of hugepages, no reserve. */
if (domain == VM_AD_HUGETLB) {
- allowed = hugetlb_total_pages();
+ allowed = hugetlb_total_pages() *
+ sysctl_overcommit_ratio[domain] / 100;

goto check;
}
@@ -129,13 +136,7 @@ static int dummy_vm_enough_memory(int do
if (domain != VM_AD_DEFAULT)
return 0;

- /*
- * Sometimes we want to use more memory than we have
- */
- if (sysctl_overcommit_memory == 1)
- return 0;
-
- if (sysctl_overcommit_memory == 0) {
+ if (sysctl_overcommit_memory[domain] == 0) {
free = get_page_cache_size();
free += nr_free_pages();
free += nr_swap_pages;
@@ -160,7 +161,7 @@ static int dummy_vm_enough_memory(int do
return -ENOMEM;
}

- allowed = totalram_pages * sysctl_overcommit_ratio / 100;
+ allowed = totalram_pages * sysctl_overcommit_ratio[domain] / 100;
allowed += total_swap_pages;

check:
diff -X /home/apw/lib/vdiff.excl -rupN reference/security/selinux/hooks.c current/security/selinux/hooks.c
--- reference/security/selinux/hooks.c 2004-03-25 15:03:33.000000000 +0000
+++ current/security/selinux/hooks.c 2004-03-25 17:16:44.000000000 +0000
@@ -1505,9 +1505,16 @@ static int selinux_vm_enough_memory(int

vm_acct_memory(domain, pages);

+ /*
+ * Sometimes we want to use more memory than we have
+ */
+ if (sysctl_overcommit_memory[domain] == 1)
+ return 0;
+
/* Check against the full compliment of hugepages, no reserve. */
if (domain == VM_AD_HUGETLB) {
- allowed = hugetlb_total_pages();
+ allowed = hugetlb_total_pages() *
+ sysctl_overcommit_ratio[domain] / 100;

goto check;
}
@@ -1518,13 +1525,7 @@ static int selinux_vm_enough_memory(int
if (domain != VM_AD_DEFAULT)
return 0;

- /*
- * Sometimes we want to use more memory than we have
- */
- if (sysctl_overcommit_memory == 1)
- return 0;
-
- if (sysctl_overcommit_memory == 0) {
+ if (sysctl_overcommit_memory[domain] == 0) {
free = get_page_cache_size();
free += nr_free_pages();
free += nr_swap_pages;
@@ -1558,7 +1559,7 @@ static int selinux_vm_enough_memory(int
return -ENOMEM;
}

- allowed = totalram_pages * sysctl_overcommit_ratio / 100;
+ allowed = totalram_pages * sysctl_overcommit_ratio[domain] / 100;
allowed += total_swap_pages;

check:

2004-03-25 21:03:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment

Andy Whitcroft <[email protected]> wrote:
>
> HUGETLB Overcommit Handling
> ---------------------------
> When building mappings the kernel tracks committed but not yet
> allocated pages against available memory and swap preventing memory
> allocation problems later. The introduction of hugetlb pages has
> has significant ramifications for this accounting as the pages used
> to back them are already removed from the available memory pool.

Sorry, but I just don't see why we need all this complexity and generality.

If there was any likelihood that there would be additional memory domains
in the 2.6 future then OK. But I don't think there will be. We simply
need some little old patch which fixes this bug.

Such as adding a `vma' arg to vm_enough_memory() and vm_unacct_memory() and
doing

if (is_vm_hugetlb_page(vma))
return;

and

- allowed = totalram_pages * sysctl_overcommit_ratio / 100;
+ allowed = (totalram_pages - htlbpagemem << HPAGE_SHIFT) *
+ sysctl_overcommit_ratio / 100;

in cap_vm_enough_memory().

2004-03-25 23:24:41

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment

--On 25 March 2004 13:04 -0800 Andrew Morton <[email protected]> wrote:

> Sorry, but I just don't see why we need all this complexity and generality.
>
> If there was any likelihood that there would be additional memory domains
> in the 2.6 future then OK. But I don't think there will be. We simply
> need some little old patch which fixes this bug.
>
> Such as adding a `vma' arg to vm_enough_memory() and vm_unacct_memory() and
> doing
>
> if (is_vm_hugetlb_page(vma))
> return;
>
> and
>
> - allowed = totalram_pages * sysctl_overcommit_ratio / 100;
> + allowed = (totalram_pages - htlbpagemem << HPAGE_SHIFT) *
> + sysctl_overcommit_ratio / 100;
>
> in cap_vm_enough_memory().

That's pretty much what you get if you only apply the first two patches. Sadly, you can't just pass a vma as you don't always have one when you are making the decision. For example when a shm segment is being created you need to commit the memory at that point, but its not been attached at all so there is no vma to check. That's why I went with an abstract domain. These patches have been tested in isolation and do seem to work.

The other patches started out wanting to solve a second issue, the generality seemed to come out naturally. I am not sure how important it is, but when we create a normal shm domain we commit the memory then. For an hugetlb one we only commit the memory when the region is attached the first time, ie when the pages are cleared and filled. Also we have no policy control over them.

In short I guess if we only are trying to fix the overcommit cross over between the normal and hugetlb, then the first two patches should be basically there.

Let me know what the decision is and I'll steer the ship in that direction.

-apw

2004-03-25 23:52:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment

Andy Whitcroft <[email protected]> wrote:
>
> --On 25 March 2004 13:04 -0800 Andrew Morton <[email protected]> wrote:
>
> > Sorry, but I just don't see why we need all this complexity and generality.
> >
> > If there was any likelihood that there would be additional memory domains
> > in the 2.6 future then OK. But I don't think there will be. We simply
> > need some little old patch which fixes this bug.
> >
> > Such as adding a `vma' arg to vm_enough_memory() and vm_unacct_memory() and
> > doing
> >
> > if (is_vm_hugetlb_page(vma))
> > return;
> >
> > and
> >
> > - allowed = totalram_pages * sysctl_overcommit_ratio / 100;
> > + allowed = (totalram_pages - htlbpagemem << HPAGE_SHIFT) *
> > + sysctl_overcommit_ratio / 100;
> >
> > in cap_vm_enough_memory().
>
> That's pretty much what you get if you only apply the first two patches. Sadly, you can't just pass a vma as you don't always have one when you are making the decision. For example when a shm segment is being created you need to commit the memory at that point, but its not been attached at all so there is no vma to check. That's why I went with an abstract domain. These patches have been tested in isolation and do seem to work.
>
> The other patches started out wanting to solve a second issue, the generality seemed to come out naturally. I am not sure how important it is, but when we create a normal shm domain we commit the memory then. For an hugetlb one we only commit the memory when the region is attached the first time, ie when the pages are cleared and filled. Also we have no policy control over them.
>
> In short I guess if we only are trying to fix the overcommit cross over between the normal and hugetlb, then the first two patches should be basically there.
>
> Let me know what the decision is and I'll steer the ship in that direction.

I think it's simply:

- Make normal overcommit logic skip hugepages completely

- Teach the overcommit_memory=2 logic that hugepages are basically
"pinned", so subtract them from the arithmetic.

And that's it. The hugepages are semantically quite different from normal
memory (prefaulted, preallocated, unswappable) and we've deliberately
avoided pretending otherwise.

As for the shm problem, well, perhaps it's best to leave vm_enough_memory()
as it is and fix it up in the callers. So most callsites will call:

static inline int vm_anough_memory_vma(struct vm_area_struct *vma,
unsigned long nr_pages)
{
if (is_vm_hugetlb_page(vma))
return 0;
return vm_enough_memory(nr_pages);
}

and in do_mmap_pgoff() perhaps we can do:

+ if (file && !is_file_hugepages(file)) {
charged = len >> PAGE_SHIFT;
if (security_vm_enough_memory(charged))
return -ENOMEM;
+ }



2004-03-26 00:08:49

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment

--On 25 March 2004 15:51 -0800 Andrew Morton <[email protected]> wrote:

> I think it's simply:
>
> - Make normal overcommit logic skip hugepages completely
>
> - Teach the overcommit_memory=2 logic that hugepages are basically
> "pinned", so subtract them from the arithmetic.
>
> And that's it. The hugepages are semantically quite different from normal
> memory (prefaulted, preallocated, unswappable) and we've deliberately
> avoided pretending otherwise.

True currently. Though the thread that prompted this was in response to
the time taken for this prefault and for the wish to fault them.

I'll have a poke about at it and see how small I can make it.

-apw

2004-03-26 00:37:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment

Keith Owens <[email protected]> wrote:
>
> FWIW, lkcd (crash dump) treats hugetlb pages as normal kernel pages and
> dumps them, which is pointless and wastes a lot of time. To avoid
> dumping these pages in lkcd, I had to add a PG_hugetlb flag. lkcd runs
> at the page level, not mm or vma, so VM_hugetlb was not available. In
> set_hugetlb_mem_size()
>
> for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
> SetPageReserved(map);
> SetPageHugetlb(map);
> map++;
> }
>
> In dump_base.c, I changed kernel_page(), referenced_page() and
> unreferenced_page() to test for PageHugetlb() before PageReserved().

That makes sense.

> Since you are looking at identifying hugetlb pages, could any other
> code benefit from a PG_hugetlb flag?

In the overcommit code we don't actually have the page yet. We're asking
"do we have enough memory available to honour this mmap() invokation when
it later faults in real pages".

2004-03-26 00:43:39

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment

> I think it's simply:
>
> - Make normal overcommit logic skip hugepages completely
>
> - Teach the overcommit_memory=2 logic that hugepages are basically
> "pinned", so subtract them from the arithmetic.
>
> And that's it. The hugepages are semantically quite different from normal
> memory (prefaulted, preallocated, unswappable) and we've deliberately
> avoided pretending otherwise.

It would be nice (to fix some of the posted problems) if hugepages didn't
have to be prefaulted ... if they had their own overcommit pool (that we
used whether normal overcommit was on or not), that'd be unnecessary.

Specifically:

1) SGI found that requesting oodles of large pages took forever.
2) NUMA allocation API wants to be able to specify policies, which
means not prefaulting them.

I'd agree that fixing stopping hugepages from using the main overcommit
pool is the first priority, but it'd be nice to go one stage further.

M.

2004-03-26 01:04:22

by Keith Owens

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment

On Thu, 25 Mar 2004 23:59:21 +0000,
Andy Whitcroft <[email protected]> wrote:
>--On 25 March 2004 15:51 -0800 Andrew Morton <[email protected]> wrote:
>
>> I think it's simply:
>>
>> - Make normal overcommit logic skip hugepages completely
>>
>> - Teach the overcommit_memory=2 logic that hugepages are basically
>> "pinned", so subtract them from the arithmetic.
>>
>> And that's it. The hugepages are semantically quite different from normal
>> memory (prefaulted, preallocated, unswappable) and we've deliberately
>> avoided pretending otherwise.
>
>True currently. Though the thread that prompted this was in response to
>the time taken for this prefault and for the wish to fault them.
>
>I'll have a poke about at it and see how small I can make it.

FWIW, lkcd (crash dump) treats hugetlb pages as normal kernel pages and
dumps them, which is pointless and wastes a lot of time. To avoid
dumping these pages in lkcd, I had to add a PG_hugetlb flag. lkcd runs
at the page level, not mm or vma, so VM_hugetlb was not available. In
set_hugetlb_mem_size()

for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
SetPageReserved(map);
SetPageHugetlb(map);
map++;
}

In dump_base.c, I changed kernel_page(), referenced_page() and
unreferenced_page() to test for PageHugetlb() before PageReserved().

Since you are looking at identifying hugetlb pages, could any other
code benefit from a PG_hugetlb flag?

2004-03-26 02:03:00

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment


--On 25 March 2004 23:59 +0000 Andy Whitcroft <[email protected]> wrote:

> --On 25 March 2004 15:51 -0800 Andrew Morton <[email protected]> wrote:
>
>> I think it's simply:
>>
>> - Make normal overcommit logic skip hugepages completely
>>
>> - Teach the overcommit_memory=2 logic that hugepages are basically
>> "pinned", so subtract them from the arithmetic.
>>
>> And that's it. The hugepages are semantically quite different from
>> normal memory (prefaulted, preallocated, unswappable) and we've
>> deliberately avoided pretending otherwise.

Attached is a ground up patch, trying just to cure the overcommit bug. The
main thrust is to ensure that VM_ACCOUNT actually only gets set on vma's
which are indeed accountable. With that ensured much of the rest comes out
in the wash. It also removes the hugetlb memory for the
overcommit_memory=2 case.

Attached are two patches, core and arch changes. They have been compile
tested on i386 and appear to work. Is that more what you had in mind?

-apw


Attachments:
(No filename) (0.99 kB)
055-hugetlb_commit_arch.txt (2.80 kB)
050-hugetlb_commit.txt (5.53 kB)
Download all attachments

2004-03-26 03:27:05

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] [0/6] HUGETLB memory commitment

On Thu, Mar 25, 2004 at 04:22:32PM -0800, Andrew Morton wrote:
> Keith Owens <[email protected]> wrote:
> >
> > FWIW, lkcd (crash dump) treats hugetlb pages as normal kernel pages and
> > dumps them, which is pointless and wastes a lot of time. To avoid
> > dumping these pages in lkcd, I had to add a PG_hugetlb flag. lkcd runs

This should already be fixed in recent versions of lkcd. It uses a
little bit of trickery to avoid an extra page flag -- hugetlb pages are
detected as "in use" as well as reserved, unlike other reserved pages
which helps identify them.

/* to track all used (compound + zero order) pages */
#define PageInuse(p) (PageCompound(p) || page_count(p))

.
.

static inline int kernel_page(struct page *p)
{
/* Need to exclude hugetlb pages. Clue: reserved but inuse */
return (PageReserved(p) && !PageInuse(p)) || (!PageLRU(p) && PageInuse(p));
}

Regards
Suparna

> > at the page level, not mm or vma, so VM_hugetlb was not available. In
> > set_hugetlb_mem_size()
> >
> > for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
> > SetPageReserved(map);
> > SetPageHugetlb(map);
> > map++;
> > }
> >
> > In dump_base.c, I changed kernel_page(), referenced_page() and
> > unreferenced_page() to test for PageHugetlb() before PageReserved().
>
> That makes sense.
>
> > Since you are looking at identifying hugetlb pages, could any other
> > code benefit from a PG_hugetlb flag?
>
> In the overcommit code we don't actually have the page yet. We're asking
> "do we have enough memory available to honour this mmap() invokation when
> it later faults in real pages".
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> Lse-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lse-tech

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2004-03-26 03:41:08

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] [0/6] HUGETLB memory commitment

On Fri, 26 Mar 2004 14:28:26 +0530,
Suparna Bhattacharya <[email protected]> wrote:
>On Thu, Mar 25, 2004 at 04:22:32PM -0800, Andrew Morton wrote:
>> Keith Owens <[email protected]> wrote:
>> >
>> > FWIW, lkcd (crash dump) treats hugetlb pages as normal kernel pages and
>> > dumps them, which is pointless and wastes a lot of time. To avoid
>> > dumping these pages in lkcd, I had to add a PG_hugetlb flag. lkcd runs
>
>This should already be fixed in recent versions of lkcd. It uses a
>little bit of trickery to avoid an extra page flag -- hugetlb pages are
>detected as "in use" as well as reserved, unlike other reserved pages
>which helps identify them.

Are you sure that this works for hugetlb pages that have been
preallocated but not yet mapped? AFAICT the hugetlb pages start off as
reserved with a zero usecount.

2004-03-26 11:44:34

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] [0/6] HUGETLB memory commitment

On Fri, Mar 26, 2004 at 02:39:09PM +1100, Keith Owens wrote:
> On Fri, 26 Mar 2004 14:28:26 +0530,
> Suparna Bhattacharya <[email protected]> wrote:
> >On Thu, Mar 25, 2004 at 04:22:32PM -0800, Andrew Morton wrote:
> >> Keith Owens <[email protected]> wrote:
> >> >
> >> > FWIW, lkcd (crash dump) treats hugetlb pages as normal kernel pages and
> >> > dumps them, which is pointless and wastes a lot of time. To avoid
> >> > dumping these pages in lkcd, I had to add a PG_hugetlb flag. lkcd runs
> >
> >This should already be fixed in recent versions of lkcd. It uses a
> >little bit of trickery to avoid an extra page flag -- hugetlb pages are
> >detected as "in use" as well as reserved, unlike other reserved pages
> >which helps identify them.
>
> Are you sure that this works for hugetlb pages that have been
> preallocated but not yet mapped? AFAICT the hugetlb pages start off as
> reserved with a zero usecount.
>

I just realised that hugetlb pages are no longer marked as reserved in
the current trees, and since they are allocated as compound pages
they would show up as being in use and not LRU. So, we do have a problem,
without PG_hugetlb.

Regards
Suparna

>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> Lse-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lse-tech

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2004-03-28 18:00:12

by Ray Bryant

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment

I guess I am missing something entirely here. I've been off making "allocate on fault" hugetlb
pages work on 2.4.21 on Altix (that is, after all, the kernel for the production code for Altix at
the present time -- It's getting close, still working on making fork() work correctly with this, and
once that is done I will move it to 2.6 and submit a patch.)

As I understood this originally, the suggestion was to reserve hugetlb pages at mmap() or shm_get()
time so that the user would get an -ENOMEM at that time if there aren't enough hugetlb pages to
(eventually) satisfy the request, as per the notion that we shouldn't modify the user API due to
going with allocate on fault instead of hugetlb_prefault().

Since the reservation belongs to the mapped object (file or segment), I've been storing the current
file/segments's reservation in the file system dependent part of the inode. That way, it is easily
accessible when the hugetlbfs file or SysV segment is removed and we can reduce the total number of
reserved pages by that file's reservation at that time. This also allows us to handle the
reservation in the absence of a vma, as per Andy'c comment below.

Admittedly this doesn't alow one to request that hugetlbpages be overcommitted, or to handle
problems caused to the "normal" page overcommit code due to the presence of hugepages. But we
figure that anyone that is actually using hugetlb pages is likely to take over almost all of main
memory anyway in a single job, so overcommit doesn't make much sense to us.

So, am completely off "in the weeds" on this or does the above seem like an acceptable, and simple,
approach?

Andy Whitcroft wrote:
> --On 25 March 2004 13:04 -0800 Andrew Morton <[email protected]> wrote:
>
>
>>Sorry, but I just don't see why we need all this complexity and generality.
>>
>>If there was any likelihood that there would be additional memory domains
>>in the 2.6 future then OK. But I don't think there will be. We simply
>>need some little old patch which fixes this bug.
>>
>>Such as adding a `vma' arg to vm_enough_memory() and vm_unacct_memory() and
>>doing
>>
>> if (is_vm_hugetlb_page(vma))
>> return;
>>
>>and
>>
>>- allowed = totalram_pages * sysctl_overcommit_ratio / 100;
>>+ allowed = (totalram_pages - htlbpagemem << HPAGE_SHIFT) *
>>+ sysctl_overcommit_ratio / 100;
>>
>>in cap_vm_enough_memory().
>
>
> That's pretty much what you get if you only apply the first two patches. Sadly, you can't just pass a vma as you don't always have one when you are making the decision. For example when a shm segment is being created you need to commit the memory at that point, but its not been attached at all so there is no vma to check. That's why I went with an abstract domain. These patches have been tested in isolation and do seem to work.
>
> The other patches started out wanting to solve a second issue, the generality seemed to come out naturally. I am not sure how important it is, but when we create a normal shm domain we commit the memory then. For an hugetlb one we only commit the memory when the region is attached the first time, ie when the pages are cleared and filled. Also we have no policy control over them.
>
> In short I guess if we only are trying to fix the overcommit cross over between the normal and hugetlb, then the first two patches should be basically there.
>
> Let me know what the decision is and I'll steer the ship in that direction.
>
> -apw
>

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-03-28 19:11:36

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment

> As I understood this originally, the suggestion was to reserve hugetlb
> pages at mmap() or shm_get() time so that the user would get an -ENOMEM
> at that time if there aren't enough hugetlb pages to (eventually) satisfy
> the request, as per the notion that we shouldn't modify the user API due
> to going with allocate on fault instead of hugetlb_prefault().

Yup, but there were two parts to it:

1. Stop hugepages using the existing overcommit pool for small pages,
which breaks small page allocations by prematurely the pool.
2. Give hugepages their own over-commit pool, instead of prefaulting.

Personally I think we need both (as you seem to), but (1) is probably
more urgent.

> Since the reservation belongs to the mapped object (file or segment),
> I've been storing the current file/segments's reservation in the file
> system dependent part of the inode. That way, it is easily accessible
> when the hugetlbfs file or SysV segment is removed and we can reduce
> the total number of reserved pages by that file's reservation at that
> time. This also allows us to handle the reservation in the absence
> of a vma, as per Andy'c comment below.

Do we need to store it there, or is one central pool number sufficient?
I would have thought it was ...

> Admittedly this doesn't alow one to request that hugetlbpages be
> overcommitted, or to handle problems caused to the "normal" page
> overcommit code due to the presence of hugepages. But we figure that
> anyone that is actually using hugetlb pages is likely to take over
> almost all of main memory anyway in a single job, so overcommit
> doesn't make much sense to us.

Seeing as you can't swap them, overcommitting makes no sense to me
either ;-)

M.

2004-03-28 21:27:15

by Ray Bryant

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] [0/6] HUGETLB memory commitment



Martin J. Bligh wrote:
>>As I understood this originally, the suggestion was to reserve hugetlb
>>pages at mmap() or shm_get() time so that the user would get an -ENOMEM
>>at that time if there aren't enough hugetlb pages to (eventually) satisfy
>>the request, as per the notion that we shouldn't modify the user API due
>>to going with allocate on fault instead of hugetlb_prefault().
>
>
> Yup, but there were two parts to it:
>
> 1. Stop hugepages using the existing overcommit pool for small pages,
> which breaks small page allocations by prematurely the pool.
> 2. Give hugepages their own over-commit pool, instead of prefaulting.
>
> Personally I think we need both (as you seem to), but (1) is probably
> more urgent.

Just to review: even if we allocate hugetlb pages at fault rather than at mmap() time, hugetlb
pages are created either at system boot time (kernel parameter "hugepages=") or by setting
/proc/sys/vm/nr_hugepages (or by using the corresponding sysctl). Once the set of hugepages is
created this way, it never is changed by the act of allocating a huge page to a process. (Changing
nr_pages can cause the number of unallocated hugetlbpages to be increased or decreased.)

The reason for pointing this out (apologies if this was obvious to all) is to emphaisze that
hugetlbpages are not created at hugetlbpage allocation time (which is now done at mmap() time and
we'd like to change it to happen at fault time).

So to stop hugepages from using the small page overcommit pool, we just need code in
set_hugetlb_mem_size() to reduce the number of hugetlbpages created by that code.

As for (2), I'm a little confused there, as later you appear to agree with me that overcomitting
hugetlbpages is likely not useful. Is it possible that you meant that there should be a list of
hugetlbpages from which all allocations are made? If so, that is the way the code has always
worked; step 1 was to create the list of hugetlbpages, and step 2 was to allocate them.

(Once again, if this is obvious to all, I apologize and we can dump the last 4 paragraphs into the
bit bucket with no known effect on entropy in this universe, at least.)

>
>
>>Since the reservation belongs to the mapped object (file or segment),
>>I've been storing the current file/segments's reservation in the file
>>system dependent part of the inode. That way, it is easily accessible
>>when the hugetlbfs file or SysV segment is removed and we can reduce
>>the total number of reserved pages by that file's reservation at that
>>time. This also allows us to handle the reservation in the absence
>>of a vma, as per Andy'c comment below.
>
>
> Do we need to store it there, or is one central pool number sufficient?
> I would have thought it was ...
>

Yes, there is a central pool number indicating how many hugepages are reserved. The question is,
when (and how) do you release that reservation? My take is that the reservation is associated with
the file (for mmap) or segment for SysV.

For example, program A mmap()'s a hugetlbfs file, but only touches part of the pages. Program B
then mmap()'s the same file with the same size, etc. When program B does the mmap() the previous
reservation should still be in place, right? (The file is persistent in the page cache even if it
does not persist over reboot, so the 2nd program is exepecting to see the data that the first
program put there.)

Ditto for a SysV segement.

So one can't release the reservation when the current process doing the mmap() goes away, one has to
release the reservation when the file/segment is deleted. Since both mmap() and shmget() create an
inode, and the inode is released by hugetlbfs_drop_inode() and friends, it seemed simplest to put
the size of the mapped object's reservation in the inode.

The global count of reserved pages (the "central pool number" in your note), is incremented at
mmap() time (well, actually done by hugetlbfs_file_mmap() for both mmap() and shmget()) and
decremented at hugetlbfs_drop_inode() time. If at mmap() time, incrementing the global reservation
count would make the global reserved pages count > the number of hugetlbpages, we fail the mmap()
with -ENONMEM.

At least that is the way my 2.4.21 code works. Does that make things clearer?

>
>>Admittedly this doesn't alow one to request that hugetlbpages be
>>overcommitted, or to handle problems caused to the "normal" page
>>overcommit code due to the presence of hugepages. But we figure that
>>anyone that is actually using hugetlb pages is likely to take over
>>almost all of main memory anyway in a single job, so overcommit
>>doesn't make much sense to us.
>
>
> Seeing as you can't swap them, overcommitting makes no sense to me
> either ;-)
>
> M.
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> Lse-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lse-tech
>

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-03-29 13:10:44

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] [0/6] HUGETLB memory commitment

--On 28 March 2004 11:10 -0800 "Martin J. Bligh" <[email protected]> wrote:

> 1. Stop hugepages using the existing overcommit pool for small pages,
> which breaks small page allocations by prematurely the pool.
> 2. Give hugepages their own over-commit pool, instead of prefaulting.

Indeed. The previous patches I submitted only address #1. Attached is
another patch which should address #2, it supplies hugetlb commit
accounting. This is checked and applied when the segment is created. It
also supplements the meminfo information to display this new commitment.
The patch only implments strict commitment, but as has been stated here
often, it is not clear that overcommit of unswappable memory makes any
sense in the absence of demand allocation. When that is implemented then
this will likely need a policy.

Patch applies on top of my previous patch and has been tested on i386.

-apw


Attachments:
(No filename) (898.00 B)
070-hugetlb_commit.txt (3.88 kB)
Download all attachments

2004-03-29 17:04:53

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] [0/6] HUGETLB memory commitment

>> Yup, but there were two parts to it:
>>
>> 1. Stop hugepages using the existing overcommit pool for small pages,
>> which breaks small page allocations by prematurely the pool.
>> 2. Give hugepages their own over-commit pool, instead of prefaulting.
>>
>> Personally I think we need both (as you seem to), but (1) is probably
>> more urgent.
>
> Just to review: even if we allocate hugetlb pages at fault rather than
> at mmap() time, hugetlb pages are created either at system boot time
> (kernel parameter "hugepages=") or by setting /proc/sys/vm/nr_hugepages
> (or by using the corresponding sysctl). Once the set of hugepages is
> created this way, it never is changed by the act of allocating a huge
> page to a process. (Changing nr_pages can cause the number of unallocated
> hugetlbpages to be increased or decreased.)

Yup.

> The reason for pointing this out (apologies if this was obvious to all)
> is to emphaisze that hugetlbpages are not created at hugetlbpage allocation
> time (which is now done at mmap() time and we'd like to change it to happen
> at fault time).

Yup.

> So to stop hugepages from using the small page overcommit pool, we just
> need code in set_hugetlb_mem_size() to reduce the number of hugetlbpages
> created by that code.

I think Andy already fixed that bit, though I'm not sure what method he used.
It seems to me (without really looking), that we just need to not decrement
the pool size when we map a huge page.

> As for (2), I'm a little confused there, as later you appear to agree
> with me that overcomitting hugetlbpages is likely not useful.

I think I'm just being confusing via sloppy terminology, but we're in
resounding agreement in reality ;-)

> Is it possible that you meant that there should be a list of hugetlbpages
> from which all allocations are made? If so, that is the way the code has
> always worked; step 1 was to create the list of hugetlbpages, and step 2
> was to allocate them.

I meant if we keep a counter of the number of hugetlb pages available, every
time we get a call to allocate them, we can avoid prefault by just decrementing
the counter of "available" pages, and fault them in later, just like the existing
strict-overcommit code does, and we'll never fail to allocate. If we're doing
*strict* NUMA bindings, it does need to be a little more complex, in that
things will need to remember which node they're "pre-allocated" from. The fact
that the "overcommit" code *prevents* overcommit is probably not helping the
discussion's clarity ;-)

> (Once again, if this is obvious to all, I apologize and we can dump the last
> 4 paragraphs into the bit bucket with no known effect on entropy in this
> universe, at least.)

Well, above is what *I* meant, and I *think* roughly what you meant. But probably
best to clarify ;-)

>>> Since the reservation belongs to the mapped object (file or segment),
>>> I've been storing the current file/segments's reservation in the file
>>> system dependent part of the inode. That way, it is easily accessible
>>> when the hugetlbfs file or SysV segment is removed and we can reduce
>>> the total number of reserved pages by that file's reservation at that
>>> time. This also allows us to handle the reservation in the absence
>>> of a vma, as per Andy'c comment below.
>>
>>
>> Do we need to store it there, or is one central pool number sufficient?
>> I would have thought it was ...
>
> Yes, there is a central pool number indicating how many hugepages are reserved.
> The question is, when (and how) do you release that reservation? My take is
> that the reservation is associated with the file (for mmap) or segment for SysV.

Ah, I see what you mean. You can't really release it at 0 refcount without changing
the semantics, in case it's re-used later. Hum. Yes, I see what you mean.

> For example, program A mmap()'s a hugetlbfs file, but only touches part of the
> pages. Program B then mmap()'s the same file with the same size, etc. When
> program B does the mmap() the previous reservation should still be in place, right?
> (The file is persistent in the page cache even if it does not persist over reboot,
> so the 2nd program is exepecting to see the data that the first program put there.)
>
> Ditto for a SysV segement.

Yes. I think Adam's patches in my tree support anon mem_map though. That's going to
get rather tricky ... we run into similar problems as objrmap, I think.

> So one can't release the reservation when the current process doing the mmap()
> goes away, one has to release the reservation when the file/segment is deleted.
> Since both mmap() and shmget() create an inode, and the inode is released by
> hugetlbfs_drop_inode() and friends, it seemed simplest to put the size of the
> mapped object's reservation in the inode.

Yup, I'd missed that - thanks for explaining ;-)

> The global count of reserved pages (the "central pool number" in your note),
> is incremented at mmap() time (well, actually done by hugetlbfs_file_mmap()
> for both mmap() and shmget()) and decremented at hugetlbfs_drop_inode() time.
> If at mmap() time, incrementing the global reservation count would make the
> global reserved pages count the number of hugetlbpages, we fail the mmap()
> with -ENONMEM.
>
> At least that is the way my 2.4.21 code works. Does that make things clearer?

A lot ;-)

Thanks,

M.

2004-03-29 20:47:12

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

>>>> Andy Whitcroft wrote on Mon, March 29, 2004 4:30 AM
> Indeed. The previous patches I submitted only address #1. Attached is
> another patch which should address #2, it supplies hugetlb commit
> accounting. This is checked and applied when the segment is created. It
> also supplements the meminfo information to display this new commitment.
> The patch only implments strict commitment, but as has been stated here
> often, it is not clear that overcommit of unswappable memory makes any
> sense in the absence of demand allocation. When that is implemented then
> this will likely need a policy.
>
> Patch applies on top of my previous patch and has been tested on i386.


+int hugetlbfs_report_meminfo(char *buf)
+{
+ long htlb = atomic_read(&hugetlb_committed_space);
+ return sprintf(buf, "HugeCommited_AS: %5lu\n", htlb);
+}

"HugeCommited_AS", typo?? Should that be double "t"? Also can we print
in terms of kB instead of num pages to match all other entries? Something
Like: htlb<<(PAGE_SHIFT-10)?

overcomit is not checked for hugetlb mmap, is it intentional here?

- Ken


2004-03-29 20:52:33

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

>>>>>> Chen, Kenneth W wrote on Mon, March 29, 2004 12:46 PM
> overcomit is not checked for hugetlb mmap, is it intentional here?

Just to follow up myself, I meant overcommit accounting is not done
for mmap hugetlb page. (typical Monday morning symptom :))

- Ken


2004-03-30 12:55:34

by Andy Whitcroft

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

--On 29 March 2004 12:45 -0800 "Chen, Kenneth W" <[email protected]> wrote:

> +int hugetlbfs_report_meminfo(char *buf)
> +{
> + long htlb = atomic_read(&hugetlb_committed_space);
> + return sprintf(buf, "HugeCommited_AS: %5lu\n", htlb);
> +}
>
> "HugeCommited_AS", typo?? Should that be double "t"? Also can we print
> in terms of kB instead of num pages to match all other entries? Something
> Like: htlb<<(PAGE_SHIFT-10)?

Doh and Doh. Yes, we went though a stage where this was in hugetlb
pages, but it has ended up in the same units as the small page
pool. Attached is a replacement patch with this changed, below
is a relative diff against the previous patch.

> overcomit is not checked for hugetlb mmap, is it intentional here?

> Just to follow up myself, I meant overcommit accounting is not done
> for mmap hugetlb page. (typical Monday morning symptom :))

Essentially, hugetlb pages can only be part of a shared mapping in
the current implementation. As a result all commitments are made
and checked at segment create time. The commitment cannot change.

Hope that's what you meant.

Martin, perhaps this is a candidate for your -mjb tree?

-apw

diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c 2004-03-29 14:05:22.000000000 +0100
+++ current/fs/hugetlbfs/inode.c 2004-03-30 09:52:59.000000000 +0100
@@ -47,8 +47,10 @@ int hugetlb_acct_memory(long delta)

int hugetlbfs_report_meminfo(char *buf)
{
+#define K(x) ((x) << (PAGE_SHIFT - 10))
long htlb = atomic_read(&hugetlb_committed_space);
- return sprintf(buf, "HugeCommited_AS: %5lu\n", htlb);
+ return sprintf(buf, "HugeCommitted_AS: %5lu kB\n", K(htlb));
+#undef K
}

static struct super_operations hugetlbfs_ops;


Attachments:
(No filename) (1.75 kB)
070-hugetlb_commit.txt (3.82 kB)
Download all attachments

2004-03-30 20:05:35

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

>>>>> Andy Whitcroft wrote on Tuesday, March 30, 2004 4:58 AM
> >
> > Just to follow up myself, I meant overcommit accounting is not done
> > for mmap hugetlb page. (typical Monday morning symptom :))
>
> Essentially, hugetlb pages can only be part of a shared mapping in
> the current implementation. As a result all commitments are made
> and checked at segment create time. The commitment cannot change.
>
> Hope that's what you meant.

Not quite, I can simply mmap on a hugetlbfs backed file to get hugetlb
pages. File expansion is transparent. It gets even trickier with file
that has holes in it.

I can do:
fd = open("/mnt/htlb/myhtlbfile", O_CREAT|O_RDWR, 0755);
mmap(..., fd, offset);

Accounting didn't happen in this case, (grep Huge /proc/meminfo):

HugePages_Total: 10
HugePages_Free: 9
Hugepagesize: 262144 kB
HugeCommitted_AS: 0 kB

Now if I remove the file "myhtlbfile", accounting is done for inode
removal and hugetlb_committed_space underflows.

HugePages_Total: 10
HugePages_Free: 10
Hugepagesize: 262144 kB
HugeCommitted_AS: 18446744073709289472 kB


2004-03-30 21:51:18

by Andy Whitcroft

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

--On 30 March 2004 12:04 -0800 "Chen, Kenneth W" <[email protected]>
wrote:

> I can do:
> fd = open("/mnt/htlb/myhtlbfile", O_CREAT|O_RDWR, 0755);
> mmap(..., fd, offset);
>
> Accounting didn't happen in this case, (grep Huge /proc/meminfo):
>
> HugePages_Total: 10
> HugePages_Free: 9
> Hugepagesize: 262144 kB
> HugeCommitted_AS: 0 kB

Oooops. Now I get you. Thanks for pointing that out. More work required.

-apw

2004-03-31 01:49:51

by Andy Whitcroft

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

--On 30 March 2004 22:48 +0100 Andy Whitcroft <[email protected]> wrote:

>> I can do:
>> fd = open("/mnt/htlb/myhtlbfile", O_CREAT|O_RDWR, 0755);
>> mmap(..., fd, offset);
>>
>> Accounting didn't happen in this case, (grep Huge /proc/meminfo):

O.k. Try this one. Should fix that case. There is some uglyness in there
which needs review, but my testing says this works.

Thanks for testing.

-apw


Attachments:
(No filename) (402.00 B)
070-hugetlb_commit.txt (4.40 kB)
Download all attachments

2004-03-31 08:52:44

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

>>>> Andy Whitcroft wrote on Tuesday, March 30, 2004 5:49 PM
>>> fd = open("/mnt/htlb/myhtlbfile", O_CREAT|O_RDWR, 0755);
>>> mmap(..., fd, offset);
>>>
>>> Accounting didn't happen in this case, (grep Huge /proc/meminfo):
>
> O.k. Try this one. Should fix that case. There is some uglyness in
> there which needs review, but my testing says this works.

Under common case, worked perfectly! But there are always corner cases.

I can think of two ugliness:
1. very sparse hugetlb file. I can mmap one hugetlb page, at offset
512 GB. This would account 512GB + 1 hugetlb page as committed_AS.
But I only asked for one page mapping. One can say it's a feature,
but I think it's a bug.

2. There is no error checking (to undo the committed_AS accounting) after
hugetlb_prefault(). hugetlb_prefault doesn't always succeed in allocat-
ing all the pages user asked for due to disk quota limit. It can have
partial allocation which would put the committed_AS in a wedged state.

- Ken


2004-03-31 16:19:06

by Andy Whitcroft

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

--On 31 March 2004 00:51 -0800 "Chen, Kenneth W" <[email protected]>
wrote:

>>>>> Andy Whitcroft wrote on Tuesday, March 30, 2004 5:49 PM
>>>> fd = open("/mnt/htlb/myhtlbfile", O_CREAT|O_RDWR, 0755);
>>>> mmap(..., fd, offset);
>>>>
>>>> Accounting didn't happen in this case, (grep Huge /proc/meminfo):
>>
>> O.k. Try this one. Should fix that case. There is some uglyness in
>> there which needs review, but my testing says this works.
>
> Under common case, worked perfectly! But there are always corner cases.
>
> I can think of two ugliness:
> 1. very sparse hugetlb file. I can mmap one hugetlb page, at offset
> 512 GB. This would account 512GB + 1 hugetlb page as committed_AS.
> But I only asked for one page mapping. One can say it's a feature,
> but I think it's a bug.

Yes. This is true. This is consistent with the preallocation behaviour of
shared memory segments, but inconsistent with the behaviour of mmap'ing
/dev/zero which it essentially emulates. This is not trival to fix as we
do not get informed when the unmap occurs. Accounting for normal pages is
handled directly by the VM unmap code. I think I have found a way to track
these but it does blur the interfaces between the hugetlbfs and hugepage
implementations.

There are a number of other 'bugs' in the implementation of hugetlb. For
example, the MAP_SHARED/MAP_PRIVATE flags are ignored, behaviour is
identical in both cases.

> 2. There is no error checking (to undo the committed_AS accounting) after
> hugetlb_prefault(). hugetlb_prefault doesn't always succeed in allocat-
> ing all the pages user asked for due to disk quota limit. It can have
> partial allocation which would put the committed_AS in a wedged state.

True, this needs work on the interface to the quota system in hugetlbfs.
We essentially need to check the quota before we attempt to fault any
pages. I'll change it around see how it looks.

Expect new patches tomorrow ...

-apw


2004-04-01 21:32:46

by Andy Whitcroft

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

--On 31 March 2004 00:51 -0800 "Chen, Kenneth W" <[email protected]> wrote:

> Under common case, worked perfectly! But there are always corner cases.
>
> I can think of two ugliness:
> 1. very sparse hugetlb file. I can mmap one hugetlb page, at offset
> 512 GB. This would account 512GB + 1 hugetlb page as committed_AS.
> But I only asked for one page mapping. One can say it's a feature,
> but I think it's a bug.
>
> 2. There is no error checking (to undo the committed_AS accounting) after
> hugetlb_prefault(). hugetlb_prefault doesn't always succeed in allocat-
> ing all the pages user asked for due to disk quota limit. It can have
> partial allocation which would put the committed_AS in a wedged state.

O.k. Here is the latest version of the hugetlb commitment tracking patch
(hugetlb_tracking_R4). This now understands the difference between shm
allocated and mmap allocated and handles them differently. This should
fix 1. We now handle the commitments correctly under quota failures.

Please review.

-apw

---
arch/i386/mm/hugetlbpage.c | 30 +++++++++++++------
file | 1
fs/hugetlbfs/inode.c | 69 +++++++++++++++++++++++++++++++++++++++++++--
fs/proc/proc_misc.c | 1
include/linux/hugetlb.h | 5 +++
5 files changed, 93 insertions(+), 13 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/i386/mm/hugetlbpage.c current/arch/i386/mm/hugetlbpage.c
--- reference/arch/i386/mm/hugetlbpage.c 2004-04-01 13:37:14.000000000 +0100
+++ current/arch/i386/mm/hugetlbpage.c 2004-04-01 21:54:54.000000000 +0100
@@ -72,6 +72,7 @@ static struct page *alloc_hugetlb_page(v
spin_unlock(&htlbpage_lock);
return NULL;
}
+printk(KERN_WARNING "alloc_hugetlb_page: alloced %08lx\n", (unsigned long) page);
htlbpagemem--;
spin_unlock(&htlbpage_lock);
set_page_count(page, 1);
@@ -282,6 +283,7 @@ static void free_huge_page(struct page *

INIT_LIST_HEAD(&page->list);

+printk(KERN_WARNING "free_huge_page: returned %08lx\n", (unsigned long) page);
spin_lock(&htlbpage_lock);
enqueue_huge_page(page);
htlbpagemem++;
@@ -334,6 +336,7 @@ int hugetlb_prefault(struct address_spac
struct mm_struct *mm = current->mm;
unsigned long addr;
int ret = 0;
+ struct page *page;

BUG_ON(vma->vm_start & ~HPAGE_MASK);
BUG_ON(vma->vm_end & ~HPAGE_MASK);
@@ -342,7 +345,6 @@ int hugetlb_prefault(struct address_spac
for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
unsigned long idx;
pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;

if (!pte) {
ret = -ENOMEM;
@@ -355,30 +357,38 @@ int hugetlb_prefault(struct address_spac
+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
page = find_get_page(mapping, idx);
if (!page) {
- /* charge the fs quota first */
+ /* charge against commitment */
+ ret = hugetlb_charge_page(vma);
+ if (ret)
+ goto out;
+ /* charge the fs quota */
if (hugetlb_get_quota(mapping)) {
ret = -ENOMEM;
- goto out;
+ goto undo_charge;
}
page = alloc_hugetlb_page();
if (!page) {
- hugetlb_put_quota(mapping);
ret = -ENOMEM;
- goto out;
+ goto undo_quota;
}
ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
unlock_page(page);
- if (ret) {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
+ if (ret)
+ goto undo_page;
}
set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
}
out:
spin_unlock(&mm->page_table_lock);
return ret;
+
+undo_page:
+ free_huge_page(page);
+undo_quota:
+ hugetlb_put_quota(mapping);
+undo_charge:
+ hugetlb_uncharge_page(vma);
+ goto out;
}

static void update_and_free_page(struct page *page)
diff -X /home/apw/lib/vdiff.excl -rupN reference/file current/file
--- reference/file 1970-01-01 01:00:00.000000000 +0100
+++ current/file 2004-04-01 13:37:14.000000000 +0100
@@ -0,0 +1 @@
+this is more text
diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c 2004-03-25 02:43:00.000000000 +0000
+++ current/fs/hugetlbfs/inode.c 2004-04-01 22:41:07.000000000 +0100
@@ -32,6 +32,53 @@
/* some random number */
#define HUGETLBFS_MAGIC 0x958458f6

+#define HUGETLBFS_NOACCT (~0UL)
+
+atomic_t hugetlb_committed_space = ATOMIC_INIT(0);
+
+int hugetlb_acct_memory(long delta)
+{
+printk(KERN_WARNING "hugetlb_acct_memory: delta<%ld>\n", delta);
+ atomic_add(delta, &hugetlb_committed_space);
+ if (delta > 0 && atomic_read(&hugetlb_committed_space) >
+ hugetlb_total_pages()) {
+ atomic_add(-delta, &hugetlb_committed_space);
+ return -ENOMEM;
+ }
+ return 0;
+}
+int hugetlb_charge_page(struct vm_area_struct *vma)
+{
+ int ret;
+
+ /* if this file is marked for commit on demand then see if we can
+ * commmit a page, if so account for it against this file. */
+ if (vma->vm_file->f_dentry->d_inode->i_blocks != ~0) {
+ ret = hugetlb_acct_memory(HPAGE_SIZE / PAGE_SIZE);
+ if (ret)
+ return ret;
+ vma->vm_file->f_dentry->d_inode->i_blocks++;
+ }
+ return 0;
+}
+int hugetlb_uncharge_page(struct vm_area_struct *vma)
+{
+ /* if this file is marked for commit on demand return a page. */
+ if (vma->vm_file->f_dentry->d_inode->i_blocks != ~0) {
+ hugetlb_acct_memory(-(HPAGE_SIZE / PAGE_SIZE));
+ vma->vm_file->f_dentry->d_inode->i_blocks--;
+ }
+ return 0;
+}
+
+int hugetlbfs_report_meminfo(char *buf)
+{
+#define K(x) ((x) << (PAGE_SHIFT - 10))
+ long htlb = atomic_read(&hugetlb_committed_space);
+ return sprintf(buf, "HugeCommitted_AS: %5lu kB\n", K(htlb));
+#undef K
+}
+
static struct super_operations hugetlbfs_ops;
static struct address_space_operations hugetlbfs_aops;
struct file_operations hugetlbfs_file_operations;
@@ -62,11 +109,11 @@ static int hugetlbfs_file_mmap(struct fi
vma_len = (loff_t)(vma->vm_end - vma->vm_start);

down(&inode->i_sem);
+ len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
file_accessed(file);
vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
vma->vm_ops = &hugetlb_vm_ops;
ret = hugetlb_prefault(mapping, vma);
- len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
if (ret == 0 && inode->i_size < len)
inode->i_size = len;
up(&inode->i_sem);
@@ -200,6 +247,11 @@ static void hugetlbfs_delete_inode(struc

if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);
+ if (inode->i_blocks != HUGETLBFS_NOACCT)
+ hugetlb_acct_memory(-(inode->i_blocks *
+ (HPAGE_SIZE / PAGE_SIZE)));
+ else
+ hugetlb_acct_memory(-(inode->i_size / PAGE_SIZE));

security_inode_delete(inode);

@@ -241,6 +293,11 @@ out_truncate:
spin_unlock(&inode_lock);
if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);
+ if (inode->i_blocks != HUGETLBFS_NOACCT)
+ hugetlb_acct_memory(-(inode->i_blocks *
+ (HPAGE_SIZE / PAGE_SIZE)));
+ else
+ hugetlb_acct_memory(-(inode->i_size / PAGE_SIZE));

if (sbinfo->free_inodes >= 0) {
spin_lock(&sbinfo->stat_lock);
@@ -350,6 +407,10 @@ static int hugetlbfs_setattr(struct dent
error = hugetlb_vmtruncate(inode, attr->ia_size);
if (error)
goto out;
+ /* We rely on the fact that the sizes are hugepage aligned,
+ * and that hugetlb_vmtruncate prevents extend. */
+ hugetlb_acct_memory((attr->ia_size - i_size_read(inode)) /
+ PAGE_SIZE);
attr->ia_valid &= ~ATTR_SIZE;
}
error = inode_setattr(inode, attr);
@@ -710,8 +771,9 @@ struct file *hugetlb_zero_setup(size_t s
if (!capable(CAP_IPC_LOCK))
return ERR_PTR(-EPERM);

- if (!is_hugepage_mem_enough(size))
- return ERR_PTR(-ENOMEM);
+ error = hugetlb_acct_memory(size / PAGE_SIZE);
+ if (error)
+ return ERR_PTR(error);

root = hugetlbfs_vfsmount->mnt_root;
snprintf(buf, 16, "%lu", hugetlbfs_counter());
@@ -736,6 +798,7 @@ struct file *hugetlb_zero_setup(size_t s
d_instantiate(dentry, inode);
inode->i_size = size;
inode->i_nlink = 0;
+ inode->i_blocks = HUGETLBFS_NOACCT;
file->f_vfsmnt = mntget(hugetlbfs_vfsmount);
file->f_dentry = dentry;
file->f_mapping = inode->i_mapping;
diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/proc/proc_misc.c current/fs/proc/proc_misc.c
--- reference/fs/proc/proc_misc.c 2004-03-29 12:10:18.000000000 +0100
+++ current/fs/proc/proc_misc.c 2004-04-01 13:37:14.000000000 +0100
@@ -232,6 +232,7 @@ static int meminfo_read_proc(char *page,
);

len += hugetlb_report_meminfo(page + len);
+ len += hugetlbfs_report_meminfo(page + len);

return proc_calc_metrics(page, start, off, count, eof, len);
#undef K
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h
--- reference/include/linux/hugetlb.h 2004-03-29 12:10:22.000000000 +0100
+++ current/include/linux/hugetlb.h 2004-04-01 21:56:56.000000000 +0100
@@ -115,11 +115,16 @@ static inline void set_file_hugepages(st
{
file->f_op = &hugetlbfs_file_operations;
}
+int hugetlbfs_report_meminfo(char *);
+int hugetlb_charge_page(struct vm_area_struct *vma);
+int hugetlb_uncharge_page(struct vm_area_struct *vma);
+
#else /* !CONFIG_HUGETLBFS */

#define is_file_hugepages(file) 0
#define set_file_hugepages(file) BUG()
#define hugetlb_zero_setup(size) ERR_PTR(-ENOSYS)
+#define hugetlbfs_report_meminfo(buf) 0

#endif /* !CONFIG_HUGETLBFS */




2004-04-01 22:59:18

by Andy Whitcroft

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

--On 01 April 2004 22:15 +0100 Andy Whitcroft <[email protected]> wrote:

> O.k. Here is the latest version of the hugetlb commitment tracking patch
> (hugetlb_tracking_R4). This now understands the difference between shm
> allocated and mmap allocated and handles them differently. This should
> fix 1. We now handle the commitments correctly under quota failures.

Ok. Here is R5, including all of the architectures hooked to the new
interface. Plus the spurious debug is gone.

-apw

---
arch/i386/mm/hugetlbpage.c | 28 +++++++++++------
arch/ia64/mm/hugetlbpage.c | 28 +++++++++++------
arch/ppc64/mm/hugetlbpage.c | 28 +++++++++++------
arch/sh/mm/hugetlbpage.c | 28 +++++++++++------
arch/sparc64/mm/hugetlbpage.c | 28 +++++++++++------
fs/hugetlbfs/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++--
fs/proc/proc_misc.c | 1
include/linux/hugetlb.h | 5 +++
8 files changed, 160 insertions(+), 52 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/i386/mm/hugetlbpage.c current/arch/i386/mm/hugetlbpage.c
--- reference/arch/i386/mm/hugetlbpage.c 2004-04-02 00:38:24.000000000 +0100
+++ current/arch/i386/mm/hugetlbpage.c 2004-04-01 22:58:48.000000000 +0100
@@ -334,6 +334,7 @@ int hugetlb_prefault(struct address_spac
struct mm_struct *mm = current->mm;
unsigned long addr;
int ret = 0;
+ struct page *page;

BUG_ON(vma->vm_start & ~HPAGE_MASK);
BUG_ON(vma->vm_end & ~HPAGE_MASK);
@@ -342,7 +343,6 @@ int hugetlb_prefault(struct address_spac
for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
unsigned long idx;
pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;

if (!pte) {
ret = -ENOMEM;
@@ -355,30 +355,38 @@ int hugetlb_prefault(struct address_spac
+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
page = find_get_page(mapping, idx);
if (!page) {
- /* charge the fs quota first */
+ /* charge against commitment */
+ ret = hugetlb_charge_page(vma);
+ if (ret)
+ goto out;
+ /* charge the fs quota */
if (hugetlb_get_quota(mapping)) {
ret = -ENOMEM;
- goto out;
+ goto undo_charge;
}
page = alloc_hugetlb_page();
if (!page) {
- hugetlb_put_quota(mapping);
ret = -ENOMEM;
- goto out;
+ goto undo_quota;
}
ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
unlock_page(page);
- if (ret) {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
+ if (ret)
+ goto undo_page;
}
set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
}
out:
spin_unlock(&mm->page_table_lock);
return ret;
+
+undo_page:
+ free_huge_page(page);
+undo_quota:
+ hugetlb_put_quota(mapping);
+undo_charge:
+ hugetlb_uncharge_page(vma);
+ goto out;
}

static void update_and_free_page(struct page *page)
diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/ia64/mm/hugetlbpage.c current/arch/ia64/mm/hugetlbpage.c
--- reference/arch/ia64/mm/hugetlbpage.c 2004-04-02 00:38:24.000000000 +0100
+++ current/arch/ia64/mm/hugetlbpage.c 2004-04-02 00:39:22.000000000 +0100
@@ -352,6 +352,7 @@ int hugetlb_prefault(struct address_spac
struct mm_struct *mm = current->mm;
unsigned long addr;
int ret = 0;
+ struct page *page;

BUG_ON(vma->vm_start & ~HPAGE_MASK);
BUG_ON(vma->vm_end & ~HPAGE_MASK);
@@ -360,7 +361,6 @@ int hugetlb_prefault(struct address_spac
for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
unsigned long idx;
pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;

if (!pte) {
ret = -ENOMEM;
@@ -373,30 +373,38 @@ int hugetlb_prefault(struct address_spac
+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
page = find_get_page(mapping, idx);
if (!page) {
- /* charge the fs quota first */
+ /* charge against commitment */
+ ret = hugetlb_charge_page(vma);
+ if (ret)
+ goto out;
+ /* charge the fs quota */
if (hugetlb_get_quota(mapping)) {
ret = -ENOMEM;
- goto out;
+ goto undo_charge;
}
page = alloc_hugetlb_page();
if (!page) {
- hugetlb_put_quota(mapping);
ret = -ENOMEM;
- goto out;
+ goto undo_quota;
}
ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
unlock_page(page);
- if (ret) {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
+ if (ret)
+ goto undo_page;
}
set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
}
out:
spin_unlock(&mm->page_table_lock);
return ret;
+
+undo_page:
+ free_huge_page(page);
+undo_quota:
+ hugetlb_put_quota(mapping);
+undo_charge:
+ hugetlb_uncharge_page(vma);
+ goto out;
}

unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/ppc64/mm/hugetlbpage.c current/arch/ppc64/mm/hugetlbpage.c
--- reference/arch/ppc64/mm/hugetlbpage.c 2004-04-02 00:38:24.000000000 +0100
+++ current/arch/ppc64/mm/hugetlbpage.c 2004-04-02 00:45:10.000000000 +0100
@@ -482,6 +482,7 @@ int hugetlb_prefault(struct address_spac
struct mm_struct *mm = current->mm;
unsigned long addr;
int ret = 0;
+ struct page *page;

WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON((vma->vm_start % HPAGE_SIZE) != 0);
@@ -491,7 +492,6 @@ int hugetlb_prefault(struct address_spac
for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
unsigned long idx;
hugepte_t *pte = hugepte_alloc(mm, addr);
- struct page *page;

BUG_ON(!in_hugepage_area(mm->context, addr));

@@ -506,30 +506,38 @@ int hugetlb_prefault(struct address_spac
+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
page = find_get_page(mapping, idx);
if (!page) {
- /* charge the fs quota first */
+ /* charge against commitment */
+ ret = hugetlb_charge_page(vma);
+ if (ret)
+ goto out;
+ /* charge the fs quota */
if (hugetlb_get_quota(mapping)) {
ret = -ENOMEM;
- goto out;
+ goto undo_charge;
}
page = alloc_hugetlb_page();
if (!page) {
- hugetlb_put_quota(mapping);
ret = -ENOMEM;
- goto out;
+ goto undo_quota;
}
ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
unlock_page(page);
- if (ret) {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
+ if (ret)
+ goto undo_page;
}
setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE);
}
out:
spin_unlock(&mm->page_table_lock);
return ret;
+
+undo_page:
+ free_huge_page(page);
+undo_quota:
+ hugetlb_put_quota(mapping);
+undo_charge:
+ hugetlb_uncharge_page(vma);
+ goto out;
}

/* Because we have an exclusive hugepage region which lies within the
diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/sh/mm/hugetlbpage.c current/arch/sh/mm/hugetlbpage.c
--- reference/arch/sh/mm/hugetlbpage.c 2004-04-02 00:36:59.000000000 +0100
+++ current/arch/sh/mm/hugetlbpage.c 2004-04-02 00:39:45.000000000 +0100
@@ -313,6 +313,7 @@ int hugetlb_prefault(struct address_spac
struct mm_struct *mm = current->mm;
unsigned long addr;
int ret = 0;
+ struct page *page;

BUG_ON(vma->vm_start & ~HPAGE_MASK);
BUG_ON(vma->vm_end & ~HPAGE_MASK);
@@ -321,7 +322,6 @@ int hugetlb_prefault(struct address_spac
for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
unsigned long idx;
pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;

if (!pte) {
ret = -ENOMEM;
@@ -334,30 +334,38 @@ int hugetlb_prefault(struct address_spac
+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
page = find_get_page(mapping, idx);
if (!page) {
- /* charge the fs quota first */
+ /* charge against commitment */
+ ret = hugetlb_charge_page(vma);
+ if (ret)
+ goto out;
+ /* charge the fs quota */
if (hugetlb_get_quota(mapping)) {
ret = -ENOMEM;
- goto out;
+ goto undo_charge;
}
page = alloc_hugetlb_page();
if (!page) {
- hugetlb_put_quota(mapping);
ret = -ENOMEM;
- goto out;
+ goto undo_quota;
}
ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
unlock_page(page);
- if (ret) {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
+ if (ret)
+ goto undo_page;
}
set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
}
out:
spin_unlock(&mm->page_table_lock);
return ret;
+
+undo_page:
+ free_huge_page(page);
+undo_quota:
+ hugetlb_put_quota(mapping);
+undo_charge:
+ hugetlb_uncharge_page(vma);
+ goto out;
}

static void update_and_free_page(struct page *page)
diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/sparc64/mm/hugetlbpage.c current/arch/sparc64/mm/hugetlbpage.c
--- reference/arch/sparc64/mm/hugetlbpage.c 2004-04-02 00:38:24.000000000 +0100
+++ current/arch/sparc64/mm/hugetlbpage.c 2004-04-02 00:39:56.000000000 +0100
@@ -309,6 +309,7 @@ int hugetlb_prefault(struct address_spac
struct mm_struct *mm = current->mm;
unsigned long addr;
int ret = 0;
+ struct page *page;

BUG_ON(vma->vm_start & ~HPAGE_MASK);
BUG_ON(vma->vm_end & ~HPAGE_MASK);
@@ -317,7 +318,6 @@ int hugetlb_prefault(struct address_spac
for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
unsigned long idx;
pte_t *pte = huge_pte_alloc(mm, addr);
- struct page *page;

if (!pte) {
ret = -ENOMEM;
@@ -330,30 +330,38 @@ int hugetlb_prefault(struct address_spac
+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
page = find_get_page(mapping, idx);
if (!page) {
- /* charge the fs quota first */
+ /* charge against commitment */
+ ret = hugetlb_charge_page(vma);
+ if (ret)
+ goto out;
+ /* charge the fs quota */
if (hugetlb_get_quota(mapping)) {
ret = -ENOMEM;
- goto out;
+ goto undo_charge;
}
page = alloc_hugetlb_page();
if (!page) {
- hugetlb_put_quota(mapping);
ret = -ENOMEM;
- goto out;
+ goto undo_quota;
}
ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
unlock_page(page);
- if (ret) {
- hugetlb_put_quota(mapping);
- free_huge_page(page);
- goto out;
- }
+ if (ret)
+ goto undo_page;
}
set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
}
out:
spin_unlock(&mm->page_table_lock);
return ret;
+
+undo_page:
+ free_huge_page(page);
+undo_quota:
+ hugetlb_put_quota(mapping);
+undo_charge:
+ hugetlb_uncharge_page(vma);
+ goto out;
}

static void update_and_free_page(struct page *page)
diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c 2004-03-25 02:43:00.000000000 +0000
+++ current/fs/hugetlbfs/inode.c 2004-04-01 23:07:02.000000000 +0100
@@ -32,6 +32,52 @@
/* some random number */
#define HUGETLBFS_MAGIC 0x958458f6

+#define HUGETLBFS_NOACCT (~0UL)
+
+atomic_t hugetlb_committed_space = ATOMIC_INIT(0);
+
+int hugetlb_acct_memory(long delta)
+{
+ atomic_add(delta, &hugetlb_committed_space);
+ if (delta > 0 && atomic_read(&hugetlb_committed_space) >
+ hugetlb_total_pages()) {
+ atomic_add(-delta, &hugetlb_committed_space);
+ return -ENOMEM;
+ }
+ return 0;
+}
+int hugetlb_charge_page(struct vm_area_struct *vma)
+{
+ int ret;
+
+ /* if this file is marked for commit on demand then see if we can
+ * commmit a page, if so account for it against this file. */
+ if (vma->vm_file->f_dentry->d_inode->i_blocks != ~0) {
+ ret = hugetlb_acct_memory(HPAGE_SIZE / PAGE_SIZE);
+ if (ret)
+ return ret;
+ vma->vm_file->f_dentry->d_inode->i_blocks++;
+ }
+ return 0;
+}
+int hugetlb_uncharge_page(struct vm_area_struct *vma)
+{
+ /* if this file is marked for commit on demand return a page. */
+ if (vma->vm_file->f_dentry->d_inode->i_blocks != ~0) {
+ hugetlb_acct_memory(-(HPAGE_SIZE / PAGE_SIZE));
+ vma->vm_file->f_dentry->d_inode->i_blocks--;
+ }
+ return 0;
+}
+
+int hugetlbfs_report_meminfo(char *buf)
+{
+#define K(x) ((x) << (PAGE_SHIFT - 10))
+ long htlb = atomic_read(&hugetlb_committed_space);
+ return sprintf(buf, "HugeCommitted_AS: %5lu kB\n", K(htlb));
+#undef K
+}
+
static struct super_operations hugetlbfs_ops;
static struct address_space_operations hugetlbfs_aops;
struct file_operations hugetlbfs_file_operations;
@@ -200,6 +246,11 @@ static void hugetlbfs_delete_inode(struc

if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);
+ if (inode->i_blocks != HUGETLBFS_NOACCT)
+ hugetlb_acct_memory(-(inode->i_blocks *
+ (HPAGE_SIZE / PAGE_SIZE)));
+ else
+ hugetlb_acct_memory(-(inode->i_size / PAGE_SIZE));

security_inode_delete(inode);

@@ -241,6 +292,11 @@ out_truncate:
spin_unlock(&inode_lock);
if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);
+ if (inode->i_blocks != HUGETLBFS_NOACCT)
+ hugetlb_acct_memory(-(inode->i_blocks *
+ (HPAGE_SIZE / PAGE_SIZE)));
+ else
+ hugetlb_acct_memory(-(inode->i_size / PAGE_SIZE));

if (sbinfo->free_inodes >= 0) {
spin_lock(&sbinfo->stat_lock);
@@ -350,6 +406,10 @@ static int hugetlbfs_setattr(struct dent
error = hugetlb_vmtruncate(inode, attr->ia_size);
if (error)
goto out;
+ /* We rely on the fact that the sizes are hugepage aligned,
+ * and that hugetlb_vmtruncate prevents extend. */
+ hugetlb_acct_memory((attr->ia_size - i_size_read(inode)) /
+ PAGE_SIZE);
attr->ia_valid &= ~ATTR_SIZE;
}
error = inode_setattr(inode, attr);
@@ -710,8 +770,9 @@ struct file *hugetlb_zero_setup(size_t s
if (!capable(CAP_IPC_LOCK))
return ERR_PTR(-EPERM);

- if (!is_hugepage_mem_enough(size))
- return ERR_PTR(-ENOMEM);
+ error = hugetlb_acct_memory(size / PAGE_SIZE);
+ if (error)
+ return ERR_PTR(error);

root = hugetlbfs_vfsmount->mnt_root;
snprintf(buf, 16, "%lu", hugetlbfs_counter());
@@ -736,6 +797,7 @@ struct file *hugetlb_zero_setup(size_t s
d_instantiate(dentry, inode);
inode->i_size = size;
inode->i_nlink = 0;
+ inode->i_blocks = HUGETLBFS_NOACCT;
file->f_vfsmnt = mntget(hugetlbfs_vfsmount);
file->f_dentry = dentry;
file->f_mapping = inode->i_mapping;
diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/proc/proc_misc.c current/fs/proc/proc_misc.c
--- reference/fs/proc/proc_misc.c 2004-04-02 00:37:04.000000000 +0100
+++ current/fs/proc/proc_misc.c 2004-04-01 22:51:19.000000000 +0100
@@ -232,6 +232,7 @@ static int meminfo_read_proc(char *page,
);

len += hugetlb_report_meminfo(page + len);
+ len += hugetlbfs_report_meminfo(page + len);

return proc_calc_metrics(page, start, off, count, eof, len);
#undef K
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h
--- reference/include/linux/hugetlb.h 2004-04-02 00:38:24.000000000 +0100
+++ current/include/linux/hugetlb.h 2004-04-01 22:51:19.000000000 +0100
@@ -115,11 +115,16 @@ static inline void set_file_hugepages(st
{
file->f_op = &hugetlbfs_file_operations;
}
+int hugetlbfs_report_meminfo(char *);
+int hugetlb_charge_page(struct vm_area_struct *vma);
+int hugetlb_uncharge_page(struct vm_area_struct *vma);
+
#else /* !CONFIG_HUGETLBFS */

#define is_file_hugepages(file) 0
#define set_file_hugepages(file) BUG()
#define hugetlb_zero_setup(size) ERR_PTR(-ENOSYS)
+#define hugetlbfs_report_meminfo(buf) 0

#endif /* !CONFIG_HUGETLBFS */


2004-04-01 23:11:36

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [PATCH] [0/6] HUGETLB memory commitment

>>>>> Andy Whitcroft wrote on Thu, April 01, 2004 1:16 PM
> --On 31 March 2004 00:51 -0800 "Chen, Kenneth W" <[email protected]> wrote:
>
> > Under common case, worked perfectly! But there are always corner cases.
> >
> > I can think of two ugliness:
> > 1. very sparse hugetlb file. I can mmap one hugetlb page, at offset
> > 512 GB. This would account 512GB + 1 hugetlb page as committed_AS.
> > But I only asked for one page mapping. One can say it's a feature,
> > but I think it's a bug.
> >
> > 2. There is no error checking (to undo the committed_AS accounting) after
> > hugetlb_prefault(). hugetlb_prefault doesn't always succeed in allocat-
> > ing all the pages user asked for due to disk quota limit. It can have
> > partial allocation which would put the committed_AS in a wedged state.
>
> O.k. Here is the latest version of the hugetlb commitment tracking patch
> (hugetlb_tracking_R4). This now understands the difference between shm
> allocated and mmap allocated and handles them differently. This should
> fix 1.
>
> diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/i386/mm/hugetlbpage.c current/arch/i386/mm/hugetlbpage.c
> --- reference/arch/i386/mm/hugetlbpage.c 2004-04-01 13:37:14.000000000 +0100
> +++ current/arch/i386/mm/hugetlbpage.c 2004-04-01 21:54:54.000000000 +0100
> @@ -355,30 +357,38 @@ int hugetlb_prefault(struct address_spac
> + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
> page = find_get_page(mapping, idx);
> if (!page) {
> - /* charge the fs quota first */
> + /* charge against commitment */
> + ret = hugetlb_charge_page(vma);
> + if (ret)
> + goto out;
> + /* charge the fs quota */
> if (hugetlb_get_quota(mapping)) {
> ret = -ENOMEM;
> - goto out;
> + goto undo_charge;
> }
> page = alloc_hugetlb_page();


committed_AS accounting is done at fault time? Doesn't that defeat the purpose
of overcommit checking at mmap time for on-demand paging?

I thought someone mentioned it since day one of this discussion: strict over-
commit is near impossible with current infrastructure in the multi-thread,
multi-process environment. I can have random number of processes mmap random
number of ranges and randomly commit each page in the future. There are just
no structure out there to keep track what will be mapped or no robust way to
find what has been mapped and how much will be needed at mmap time.

Can we just RIP this whole hugetlb page overcommit?

- Ken


2004-04-03 03:51:58

by Ray Bryant

[permalink] [raw]
Subject: Re: [PATCH] HUGETLB memory commitment

Chen, Kenneth W wrote:

>
> Can we just RIP this whole hugetlb page overcommit?
>
> - Ken
>
>
>

Ken et al,

Perhaps the following patch might be more to your liking. I'm sorry I haven't
been contributing to this discussion -- I've been off doing this code first
for Altix under 2.4.21 (one's got to eat, after all). Now I've ported the
changes forward to Linux 2.6.5-rc3 and tested them. The patch below is
relative to that version of Linux.

A few points to be made about this patch:

(1) This patch includes "allocate on fault" and "hugetlb memory commit"
changes. One can argue that this is mixing two changes into a single patch,
but the two changes seem intertwined to me -- one doesn't make sense without
the other.

(2) I've only done the ia64 version. I've not yet tackled Andrew's
suggestion that we move the common parts of the arch dependent hugetlbpage.c
up into ./mm. So, since hugetlbfs_file_mmap() in this patch no longer calls
hugetlb_prefault(), this patch will break hugetlbpage support on architectures
other than ia64 until those architectures are fixed or we move the common code
up into the machine dependent mm directory. If we can get agreement on this
patch then those that understand the arch components can help get the common
code defined and we can move that common code up to ./mm.

(3) This code uses a simple implementation of the hugetlb memory commit
stuff. It only accounts for hugetlb pages since creation of hugetlbpages
effectively moves those pages out of the common memory pool anyway. It
satisfies the requirement that an mmap() for hugetlbpages get an -ENOMEM if
there are not enough hugetlb pages to (eventually) satisfy the request as
hugetlb pages are faulted in, as per Andew's suggesion that no interface
changes be made due to allocation of hugetlbpages at fault time instead of at
hugetlb_prefault() time.

The hugetlb memory commit code does this with a single global counter:
htlbzone_reserved, and a per inode reserved page count. The latter is used to
decrement the global reserved page count when the inode is deleted or the file
is truncated.

The code does not change the SysV paths in the kernel. Instead it implements
the reservation scheme in hugetlbfs_file_mmap() which is common code for
mmap() and shmget().

This is the reason that a separate reserved page counter is needed in the
inode. (One might think that one could encode the reserved page count in the
size field of the inode, but I was unable to get that to work because the SysV
shared memory code sets inode->i_size before calling hugetlbfs_file_mmap().
So when we get there from the SysV code we are unable to recognize whether or
not this is the first time we have seen this inode, and thus need to reserve
pages, or this is a subsequent remapping of the inode that uses a previously
established reservation. Using a separate field in the inode for this solves
that problem.)

Suggestions, flames, criticisms or (gasp) even praise gladly accepted by the
undersigned. :-)

============================================================================
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.1767 -> 1.1769
# include/linux/fs.h 1.295 -> 1.296
# mm/memory.c 1.154 -> 1.155
# include/linux/hugetlb.h 1.23 -> 1.24
# arch/ia64/mm/hugetlbpage.c 1.19 -> 1.21
# fs/hugetlbfs/inode.c 1.40 -> 1.41
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 04/04/02 [email protected] 1.1768
# memory.c:
# Change handle_mm_fault() to call hugetlb_do_no_page() if this the fault
# is for a hugetlb vma. Change call to follow_hugetlb_page() to match the new
# definition of same in hugetlbpage.c
# hugetlb.h:
# Header changes related to hugetlbpage.c allocate on fault changes.
# fs.h:
# Add union member "data" to union "u" in inode struct. This overlays
# void *generic_ip. Used to hold reservation of hugetlbpages assigned
# to this inode.
# inode.c:
# Rewrite hugetlbfs_file_mmap() to eliminate hugetlb_prefault() and to
# handle reservation of hugetlbpages via htlb_reserve()/unreserve()
# hugetlbpage.c:
# Eliminate hugetlb_prefault(), replace with hugetlb_do_no_page.
# Hugetlb pages now allocated at page fault time rather than mmap() time.
# Move zeroing of hugetlbpage out of alloc_hugetlb_page().
# Introduce htlbpage_reserved, htlb_reserve(), hugetlb_unreserve() to manage
# reservation of hugetlbpages, so we can return -ENOMEM if there will not be
# enough pages to (eventually) be allocated to satisfy this request.
# --------------------------------------------
# 04/04/02 [email protected] 1.1769
# hugetlbpage.c:
# Put check into decrease nr_hugepages loop in set_hugetlb_mem_size()
# to make sure we don't reduce the number of hugetlbpages below number
# of reserved hugetlbpages.
# --------------------------------------------
#
diff -Nru a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
--- a/arch/ia64/mm/hugetlbpage.c Fri Apr 2 19:31:56 2004
+++ b/arch/ia64/mm/hugetlbpage.c Fri Apr 2 19:31:56 2004
@@ -26,6 +26,7 @@
int htlbpage_max;
static long htlbzone_pages;
unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;
+static long htlbzone_reserved;

static struct list_head hugepage_freelists[MAX_NUMNODES];
static spinlock_t htlbpage_lock = SPIN_LOCK_UNLOCKED;
@@ -65,9 +66,17 @@

void free_huge_page(struct page *page);

-static struct page *alloc_hugetlb_page(void)
+static inline void zero_hugetlb_page(struct page *page)
{
int i;
+
+ for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i) {
+ clear_highpage(&page[i]);
+ }
+}
+
+static struct page *alloc_hugetlb_page(void)
+{
struct page *page;

spin_lock(&htlbpage_lock);
@@ -80,8 +89,6 @@
spin_unlock(&htlbpage_lock);
set_page_count(page, 1);
page->lru.prev = (void *)free_huge_page;
- for (i = 0; i < (HPAGE_SIZE/PAGE_SIZE); ++i)
- clear_highpage(&page[i]);
return page;
}

@@ -153,20 +160,22 @@
{
pte_t *src_pte, *dst_pte, entry;
struct page *ptepage;
- unsigned long addr = vma->vm_start;
- unsigned long end = vma->vm_end;
+ unsigned long addr;

- while (addr < end) {
+ for (addr=vma->vm_start; addr<vma->vm_end; addr += HPAGE_SIZE) {
dst_pte = huge_pte_alloc(dst, addr);
if (!dst_pte)
goto nomem;
src_pte = huge_pte_offset(src, addr);
+ if (!src_pte)
+ continue;
entry = *src_pte;
- ptepage = pte_page(entry);
- get_page(ptepage);
+ if (pte_present(entry)) {
+ ptepage = pte_page(entry);
+ get_page(ptepage);
+ dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+ }
set_pte(dst_pte, entry);
- dst->rss += (HPAGE_SIZE / PAGE_SIZE);
- addr += HPAGE_SIZE;
}
return 0;
nomem:
@@ -174,9 +183,9 @@
}

int
-follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
+follow_hugetlb_page(struct task_struct *tsk, struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
- unsigned long *st, int *length, int i)
+ unsigned long *st, int *length, int i, int write)
{
pte_t *ptep, pte;
unsigned long start = *st;
@@ -187,6 +196,18 @@
do {
pstart = start & HPAGE_MASK;
ptep = huge_pte_offset(mm, start);
+
+ /*
+ * the page was reserved, we should get it with a minor fault
+ * since hugetlb pages are never swapped out
+ */
+ if (!ptep || !pte_present(*ptep)) {
+ if (handle_mm_fault(mm, vma, start, write) != 1)
+ BUG();
+ tsk->min_flt++;
+ ptep = huge_pte_offset(mm, start);
+ }
+
pte = *ptep;

back1:
@@ -228,6 +249,12 @@
pte_t *ptep;

ptep = huge_pte_offset(mm, addr);
+ if (!ptep || !pte_present(*ptep)) {
+ if (handle_mm_fault(mm, vma, addr, write) != 1)
+ BUG();
+ current->min_flt++;
+ ptep = huge_pte_offset(mm, addr);
+ }
page = pte_page(*ptep);
page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
get_page(page);
@@ -347,6 +374,7 @@
spin_unlock(&mm->page_table_lock);
}

+#ifdef NOTDEF
int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
{
struct mm_struct *mm = current->mm;
@@ -398,6 +426,32 @@
spin_unlock(&mm->page_table_lock);
return ret;
}
+#endif
+
+int hugetlb_reserve(int nr_hugepages)
+{
+ int rc;
+
+ spin_lock(&htlbpage_lock);
+ if ((htlbzone_reserved + nr_hugepages) <= htlbzone_pages) {
+ htlbzone_reserved += nr_hugepages;
+ rc = nr_hugepages;
+ goto out_unlock;
+ }
+ rc = -ENOMEM;
+
+out_unlock:
+ spin_unlock(&htlbpage_lock);
+ return rc;
+}
+
+void hugetlb_unreserve(int nr_hugepages)
+{
+ spin_lock(&htlbpage_lock);
+ htlbzone_reserved -= nr_hugepages;
+ spin_unlock(&htlbpage_lock);
+ return;
+}

unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
@@ -422,6 +476,8 @@
addr = ALIGN(vmm->vm_end, HPAGE_SIZE);
}
}
+
+/* caller must hold htlbpage_lock */
void update_and_free_page(struct page *page)
{
int j;
@@ -498,6 +554,8 @@
/* Shrink the memory size. */
lcount = try_to_free_low(lcount);
while (lcount++) {
+ if (htlbzone_pages <= htlbzone_reserved)
+ break;
page = alloc_hugetlb_page();
if (page == NULL)
break;
@@ -541,6 +599,7 @@
printk(KERN_WARNING "Invalid huge page size specified\n");
return 1;
}
+ htlbzone_reserved = 0;

hpage_shift = __ffs(size);
/*
@@ -577,12 +636,14 @@
int hugetlb_report_meminfo(char *buf)
{
return sprintf(buf,
- "HugePages_Total: %5lu\n"
- "HugePages_Free: %5lu\n"
- "Hugepagesize: %5lu kB\n",
+ "HugePages_Total: %5lu\n"
+ "HugePages_Free: %5lu\n"
+ "Hugepagesize: %5lu kB\n"
+ "HugePages_Reserved: %5lu\n",
htlbzone_pages,
htlbpagemem,
- HPAGE_SIZE/1024);
+ HPAGE_SIZE/1024,
+ htlbzone_reserved);
}

int is_hugepage_mem_enough(size_t size)
@@ -602,6 +663,72 @@
{
BUG();
return NULL;
+}
+
+/*
+ * enter with mm->mmap_sem held in read mode and mm->page_table_lock held
+ * drops mm->page_table_lock before returning
+ */
+int hugetlb_do_no_page(struct mm_struct * mm, struct vm_area_struct * vma,
+ unsigned long address, int write_access)
+{
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
+ pte_t *pte;
+ unsigned long idx;
+ struct page *page;
+ int rc;
+
+ address &= ~(HPAGE_SIZE-1);
+
+ pte = huge_pte_alloc(mm, address);
+
+ if (!pte) {
+ rc = 0;
+ goto unlock_and_return;
+ }
+
+ /*
+ * we don't drop the page_table_lock before here, so if we find the
+ * pte valid now, then a previous lock holder handled the fault
+ */
+ if (!pte_none(*pte)) {
+ rc = 1;
+ goto unlock_and_return;
+ }
+
+ mm->rss += HPAGE_SIZE / PAGE_SIZE;
+
+ idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+ + (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+ page = find_get_page(mapping, idx);
+
+ if (!page) {
+
+ page = alloc_hugetlb_page();
+
+ /* we reserved the page in hugetlbfs_file_mmap() */
+ BUG_ON(!page);
+
+ add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+ unlock_page(page);
+
+ set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+ mark_page_accessed(page);
+ update_mmu_cache(vma, address, *pte);
+ spin_unlock(&mm->page_table_lock);
+ zero_hugetlb_page(page);
+ return 1;
+
+ }
+ mark_page_accessed(page);
+ update_mmu_cache(vma, address, *pte);
+ set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
+ rc = 1;
+
+unlock_and_return:
+ spin_unlock(&mm->page_table_lock);
+ return rc;
}

struct vm_operations_struct hugetlb_vm_ops = {
diff -Nru a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c Fri Apr 2 19:31:56 2004
+++ b/fs/hugetlbfs/inode.c Fri Apr 2 19:31:56 2004
@@ -47,8 +47,7 @@
{
struct inode *inode = file->f_dentry->d_inode;
struct address_space *mapping = inode->i_mapping;
- loff_t len, vma_len;
- int ret;
+ unsigned long reserved_pages, prev_reserved_pages, new_reservation;

if (vma->vm_start & ~HPAGE_MASK)
return -EINVAL;
@@ -59,19 +58,34 @@
if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
return -EINVAL;

- vma_len = (loff_t)(vma->vm_end - vma->vm_start);
+ reserved_pages = (vma->vm_end - vma->vm_start) >> HPAGE_SHIFT;

down(&inode->i_sem);
file_accessed(file);
+
+ /* a second mmap() (or a rmap()) can change the reservation */
+ prev_reserved_pages = inode->u.data;
+
+ /*
+ * if current mmap() is smaller than previous reservation,
+ * we don't change reservation or quota
+ */
+ if (reserved_pages >= prev_reserved_pages) {
+ new_reservation = reserved_pages - prev_reserved_pages;
+ if ((hugetlb_get_quota(mapping, new_reservation) < 0) ||
+ (hugetlb_reserve(new_reservation) < 0)) {
+ up(&inode->i_sem);
+ return -ENOMEM;
+ }
+ inode->i_size = reserved_pages << HPAGE_SHIFT;
+ inode->u.data = reserved_pages;
+ }
+ up(&inode->i_sem);
+
vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
vma->vm_ops = &hugetlb_vm_ops;
- ret = hugetlb_prefault(mapping, vma);
- len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
- if (ret == 0 && inode->i_size < len)
- inode->i_size = len;
- up(&inode->i_sem);

- return ret;
+ return 0;
}

/*
@@ -158,6 +172,7 @@
void truncate_hugepages(struct address_space *mapping, loff_t lstart)
{
const pgoff_t start = lstart >> HPAGE_SHIFT;
+ struct inode *inode = mapping->host;
struct pagevec pvec;
pgoff_t next;
int i;
@@ -181,10 +196,11 @@
++next;
truncate_huge_page(page);
unlock_page(page);
- hugetlb_put_quota(mapping);
}
huge_pagevec_release(&pvec);
}
+ hugetlb_put_quota(mapping, inode->u.data-start);
+ hugetlb_unreserve(inode->u.data-start);
BUG_ON(!lstart && mapping->nrpages);
}

@@ -198,8 +214,11 @@
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);

- if (inode->i_data.nrpages)
- truncate_hugepages(&inode->i_data, 0);
+ /*
+ * we need to call this even if inode->i_data.nrpages == 0 to
+ * update the file system quota and hugetlb page reservation
+ */
+ truncate_hugepages(&inode->i_data, 0);

security_inode_delete(inode);

@@ -239,8 +258,11 @@
inode->i_state |= I_FREEING;
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
- if (inode->i_data.nrpages)
- truncate_hugepages(&inode->i_data, 0);
+ /*
+ * we need to call this even if inode->i_data.nrpages == 0 to
+ * update the file system quota and hugetlb page reservation
+ */
+ truncate_hugepages(&inode->i_data, 0);

if (sbinfo->free_inodes >= 0) {
spin_lock(&sbinfo->stat_lock);
@@ -383,6 +405,7 @@
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+ inode->u.data = 0;
switch (mode & S_IFMT) {
default:
init_special_inode(inode, mode, dev);
@@ -641,15 +664,15 @@
return -ENOMEM;
}

-int hugetlb_get_quota(struct address_space *mapping)
+int hugetlb_get_quota(struct address_space *mapping, unsigned long nr_pages)
{
int ret = 0;
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);

if (sbinfo->free_blocks > -1) {
spin_lock(&sbinfo->stat_lock);
- if (sbinfo->free_blocks > 0)
- sbinfo->free_blocks--;
+ if (sbinfo->free_blocks >= nr_pages)
+ sbinfo->free_blocks -= nr_pages;
else
ret = -ENOMEM;
spin_unlock(&sbinfo->stat_lock);
@@ -658,13 +681,13 @@
return ret;
}

-void hugetlb_put_quota(struct address_space *mapping)
+void hugetlb_put_quota(struct address_space *mapping, unsigned long nr_pages)
{
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(mapping->host->i_sb);

if (sbinfo->free_blocks > -1) {
spin_lock(&sbinfo->stat_lock);
- sbinfo->free_blocks++;
+ sbinfo->free_blocks += nr_pages;
spin_unlock(&sbinfo->stat_lock);
}
}
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h Fri Apr 2 19:31:56 2004
+++ b/include/linux/fs.h Fri Apr 2 19:31:56 2004
@@ -425,6 +425,7 @@
__u32 i_generation;
union {
void *generic_ip;
+ long data;
} u;
#ifdef __NEED_I_SIZE_ORDERED
seqcount_t i_size_seqcount;
diff -Nru a/include/linux/hugetlb.h b/include/linux/hugetlb.h
--- a/include/linux/hugetlb.h Fri Apr 2 19:31:56 2004
+++ b/include/linux/hugetlb.h Fri Apr 2 19:31:56 2004
@@ -12,10 +12,12 @@

int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void *, size_t *);
int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
-int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
+int follow_hugetlb_page(struct task_struct *, struct mm_struct *, struct vm_area_struct *,
+ struct page **, struct vm_area_struct **, unsigned long *, int *, int, int);
void zap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
-int hugetlb_prefault(struct address_space *, struct vm_area_struct *);
+int hugetlb_reserve(int);
+void hugetlb_unreserve(int);
void huge_page_release(struct page *);
int hugetlb_report_meminfo(char *);
int is_hugepage_mem_enough(size_t);
@@ -111,8 +113,10 @@
extern struct file_operations hugetlbfs_file_operations;
extern struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_zero_setup(size_t);
-int hugetlb_get_quota(struct address_space *mapping);
-void hugetlb_put_quota(struct address_space *mapping);
+int hugetlb_get_quota(struct address_space *mapping, unsigned long nr_pages);
+void hugetlb_put_quota(struct address_space *mapping, unsigned long nr_pages);
+extern int hugetlb_do_no_page(struct mm_struct * mm, struct vm_area_struct * vma,
+ unsigned long address, int write_access);

static inline int is_file_hugepages(struct file *file)
{
diff -Nru a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c Fri Apr 2 19:31:56 2004
+++ b/mm/memory.c Fri Apr 2 19:31:56 2004
@@ -741,8 +741,8 @@
return i ? : -EFAULT;

if (is_vm_hugetlb_page(vma)) {
- i = follow_hugetlb_page(mm, vma, pages, vmas,
- &start, &len, i);
+ i = follow_hugetlb_page(tsk, mm, vma, pages, vmas,
+ &start, &len, i, write);
continue;
}
spin_lock(&mm->page_table_lock);
@@ -1619,7 +1619,7 @@
inc_page_state(pgfault);

if (is_vm_hugetlb_page(vma))
- return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+ return hugetlb_do_no_page(mm, vma, address, write_access);

/*
* We need the page table lock to synchronize with kswapd

============================================================================


--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------


2004-04-04 03:32:49

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [PATCH] HUGETLB memory commitment

>>>>> Ray Bryant wrote on Fri, April 02, 2004 7:57 PM
> Chen, Kenneth W wrote:
> >
> > Can we just RIP this whole hugetlb page overcommit?
> >
>
> Ken et al,
>
> Perhaps the following patch might be more to your liking. I'm
> sorry I haven't been contributing to this discussion -- I've been
> off doing this code first for Altix under 2.4.21 (one's got to eat,
> after all). Now I've ported the changes forward to Linux 2.6.5-rc3
> and tested them. The patch below is relative to that version of Linux.

Somehow the patch came through with extra white space at beginning of
each line, but s/^ / / fix that up.


> The hugetlb memory commit code does this with a single global counter:
> htlbzone_reserved, and a per inode reserved page count. The latter is
> used to decrement the global reserved page count when the inode is
> deleted or the file is truncated.

A simple counter won't work for different file offset mapping. It has to
be some sort of per-inode, per-block reservation tracking. I think we are
steering in the right direction though.


> diff -Nru a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> --- a/fs/hugetlbfs/inode.c Fri Apr 2 19:31:56 2004
> +++ b/fs/hugetlbfs/inode.c Fri Apr 2 19:31:56 2004
> @@ -59,19 +58,34 @@
> if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
> return -EINVAL;
>
> - vma_len = (loff_t)(vma->vm_end - vma->vm_start);
> + reserved_pages = (vma->vm_end - vma->vm_start) >> HPAGE_SHIFT;
>
> down(&inode->i_sem);
> file_accessed(file);
> +
> + /* a second mmap() (or a rmap()) can change the reservation */
> + prev_reserved_pages = inode->u.data;
> +
> + /*
> + * if current mmap() is smaller than previous reservation,
> + * we don't change reservation or quota
> + */
> + if (reserved_pages >= prev_reserved_pages) {
> + new_reservation = reserved_pages - prev_reserved_pages;
> + if ((hugetlb_get_quota(mapping, new_reservation) < 0) ||
> + (hugetlb_reserve(new_reservation) < 0)) {
> + up(&inode->i_sem);
> + return -ENOMEM;
> + }
> + inode->i_size = reserved_pages << HPAGE_SHIFT;
> + inode->u.data = reserved_pages;
> + }
> + up(&inode->i_sem);
> +

This assumes all mmap start from the same file offset. IMO, it's not
generic enough. This code will only reserve 1 page for the following
case, but actually there are 4 mapping totaling 4 pages:

mmap 1 page at file offset 0
mmap 1 page at file offset HPAGE_SIZE,
mmap 1 page at file offset HPAGE_SIZE*2,
mmap 1 page at file offset HPAGE_SIZE*3,

Oh, this code broke file system quota accounting as well.

- Ken


2004-04-04 22:10:51

by Ray Bryant

[permalink] [raw]
Subject: Re: [PATCH] HUGETLB memory commitment

Ken,

If you have user space code that tests this that you can send me I'll use them
to fix up the reservation and quota code to handle this case as well.

Thanks,

Chen, Kenneth W wrote:
>>>
>
>
> This assumes all mmap start from the same file offset. IMO, it's not
> generic enough. This code will only reserve 1 page for the following
> case, but actually there are 4 mapping totaling 4 pages:
>
> mmap 1 page at file offset 0
> mmap 1 page at file offset HPAGE_SIZE,
> mmap 1 page at file offset HPAGE_SIZE*2,
> mmap 1 page at file offset HPAGE_SIZE*3,
>
> Oh, this code broke file system quota accounting as well.
>
> - Ken
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-04-05 16:20:43

by Ray Bryant

[permalink] [raw]
Subject: Re: [Lse-tech] RE: [PATCH] HUGETLB memory commitment

Ken,

Chen, Kenneth W wrote:

>
> A simple counter won't work for different file offset mapping. It has to
> be some sort of per-inode, per-block reservation tracking. I think we are
> steering in the right direction though.
>
>
>

OK, pardon my question about test code, that is trivial enough I guess.

Anyway, the only way I can see to make this work with non-zero offset is to
hang a list of segment descriptors (offset and size) for each reserved segment
off of the inode. Then when a new mapping comes in, we search the segment
list to see if the new offset and size overlaps with any of the existing
reserved segments. If it doesn't, then we make a new reservation (and request
file system quota) for the current size, and add the current request to the
reserved segment list. If it does, and it fits entirely in a previously
reserved segement, then no change to reservation/quota needs to be made. If
it only partially fits, then we need to make a new reservation/quota request
for the number of new huge pages required and update the overlapping segment's
length to reflect the new reservation.

Then in truncate_hugepages() we can search the segment list again, discarding
full or partial segments that occur either entirely or partially beyond
"lstart", as appropropriate and doing hugetlb_unreserve() and
hugetlbfs_put_quota() for the appropriate number of pages.

This will be quite a bit of code and complexity. Do we still think this is
all worth it to follow Andrew's suggestion of no API changes for "allocate on
fault" hugetlbpages? It would be a lot cleaner just to return SIGBUS if we
run out of hugepages and be done with it, in spite of the API change.

Is there a simpler way to do the correct reservation? (One could allocate the
pages at mmap() time, resurrecting hugetlb_prefault(), but zero the pages at
fault time, this would solve the original problem we ran into at SGI, but
would not solve Andi's requirement to postpone allocation so NUMA API's can
control placement.)

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-04-05 17:02:19

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [Lse-tech] RE: [PATCH] HUGETLB memory commitment

>>>> Ray Bryant wrote on Mon, April 05, 2004 8:27 AM
> Chen, Kenneth W wrote:
>
> >
> > A simple counter won't work for different file offset mapping. It has to
> > be some sort of per-inode, per-block reservation tracking. I think we are
> > steering in the right direction though.
> >
> >
>
> OK, pardon my question about test code, that is trivial enough I guess.
>
> Anyway, the only way I can see to make this work with non-zero offset is to
> hang a list of segment descriptors (offset and size) for each reserved segment
> off of the inode. Then when a new mapping comes in, we search the segment
> list to see if the new offset and size overlaps with any of the existing
> reserved segments. If it doesn't, then we make a new reservation (and request
> file system quota) for the current size, and add the current request to the
> reserved segment list. If it does, and it fits entirely in a previously
> reserved segement, then no change to reservation/quota needs to be made. If
> it only partially fits, then we need to make a new reservation/quota request
> for the number of new huge pages required and update the overlapping segment's
> length to reflect the new reservation.
>
> Then in truncate_hugepages() we can search the segment list again, discarding
> full or partial segments that occur either entirely or partially beyond
> "lstart", as appropropriate and doing hugetlb_unreserve() and
> hugetlbfs_put_quota() for the appropriate number of pages.
>
> This will be quite a bit of code and complexity. Do we still think this is
> all worth it to follow Andrew's suggestion of no API changes for "allocate on
> fault" hugetlbpages? It would be a lot cleaner just to return SIGBUS if we
> run out of hugepages and be done with it, in spite of the API change.
>
> Is there a simpler way to do the correct reservation? (One could allocate the
> pages at mmap() time, resurrecting hugetlb_prefault(), but zero the pages at
> fault time, this would solve the original problem we ran into at SGI, but
> would not solve Andi's requirement to postpone allocation so NUMA API's can
> control placement.)

I actually started coding yesterday. It doesn't look too bad (I think). I will
post it once I finished it up later today or tomorrow.

There are still some oddity in lifetime of the huge page reservation, but that
can be discussed once everyone sees the code.

- Ken


2004-04-05 19:16:42

by Ray Bryant

[permalink] [raw]
Subject: Re: [Lse-tech] RE: [PATCH] HUGETLB memory commitment



Chen, Kenneth W wrote:

>
>
> I actually started coding yesterday. It doesn't look too bad (I think). I will
> post it once I finished it up later today or tomorrow.
>

Hmmm...so did I. Oh well. We can pull the good ideas from both. :-)

> There are still some oddity in lifetime of the huge page reservation, but that
> can be discussed once everyone sees the code.
>
> - Ken
>
>
>

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-04-05 23:20:21

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [Lse-tech] RE: [PATCH] HUGETLB memory commitment

>>>> Ray Bryant wrote on Monday, April 05, 2004 11:22 AM
> > Chen, Kenneth W wrote:
> > I actually started coding yesterday. It doesn't look too bad (I think).
> > I will post it once I finished it up later today or tomorrow.
>
> Hmmm...so did I. Oh well. We can pull the good ideas from both. :-)

I did have a revelation from your original demand-paging patch with per-inode
tracking ;-) I extended it into tracking by struct address_space (so we don't
pollute inode structure) and added per-block tracking. See patch at the end of
this post. I admit I had very pessimistic thoughts until I saw your patch.


> > There are still some oddity in lifetime of the huge page reservation,
> > but that can be discussed once everyone sees the code.

I was thinking the lifetime of the huge page reservation should be the life
of a mapping, i.e., only persist across mmap/munmap. That means add a ref
count in the per-block tracking. This seriously complicates the design
because now, ref count needs to be updated in munmap and fault_hander in
addition to the mmap and truncate. Not to mention that Andy Whitcroft already
pointed out we don't get notification from munmap. Plus it seriously make
tracking logic complicated and have performance down side as well.

I guess everyone is OK with reservation lives until file truncate?



Patch enclosed, less than 140 lines of change (only x86 and ia64 for now,
should be trivial to add other arch). Tested on linux-2.6.5

arch/i386/mm/hugetlbpage.c | 5 +
arch/ia64/mm/hugetlbpage.c | 5 +
fs/hugetlbfs/inode.c | 119 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/hugetlb.h | 8 +++
4 files changed, 135 insertions(+), 2 deletions(-)

diff -Nurp linux-2.6.5/arch/i386/mm/hugetlbpage.c linux-2.6.5.htlb/arch/i386/mm/hugetlbpage.c
--- linux-2.6.5/arch/i386/mm/hugetlbpage.c 2004-04-03 19:38:15.000000000 -0800
+++ linux-2.6.5.htlb/arch/i386/mm/hugetlbpage.c 2004-04-05 16:09:29.000000000 -0700
@@ -22,7 +22,8 @@

static long htlbpagemem;
int htlbpage_max;
-static long htlbzone_pages;
+long htlbzone_pages;
+long htlbpage_resv;

static struct list_head hugepage_freelists[MAX_NUMNODES];
static spinlock_t htlbpage_lock = SPIN_LOCK_UNLOCKED;
@@ -516,9 +517,11 @@ int hugetlb_report_meminfo(char *buf)
return sprintf(buf,
"HugePages_Total: %5lu\n"
"HugePages_Free: %5lu\n"
+ "HUgePages_Resv: %5lu\n"
"Hugepagesize: %5lu kB\n",
htlbzone_pages,
htlbpagemem,
+ htlbpage_resv,
HPAGE_SIZE/1024);
}

diff -Nurp linux-2.6.5/arch/ia64/mm/hugetlbpage.c linux-2.6.5.htlb/arch/ia64/mm/hugetlbpage.c
--- linux-2.6.5/arch/ia64/mm/hugetlbpage.c 2004-04-03 19:37:07.000000000 -0800
+++ linux-2.6.5.htlb/arch/ia64/mm/hugetlbpage.c 2004-04-05 16:09:41.000000000 -0700
@@ -24,7 +24,8 @@

static long htlbpagemem;
int htlbpage_max;
-static long htlbzone_pages;
+long htlbzone_pages;
+long htlbpage_resv;
unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;

static struct list_head hugepage_freelists[MAX_NUMNODES];
@@ -579,9 +580,11 @@ int hugetlb_report_meminfo(char *buf)
return sprintf(buf,
"HugePages_Total: %5lu\n"
"HugePages_Free: %5lu\n"
+ "HUgePages_Resv: %5lu\n"
"Hugepagesize: %5lu kB\n",
htlbzone_pages,
htlbpagemem,
+ htlbpage_resv,
HPAGE_SIZE/1024);
}

diff -Nurp linux-2.6.5/fs/hugetlbfs/inode.c linux-2.6.5.htlb/fs/hugetlbfs/inode.c
--- linux-2.6.5/fs/hugetlbfs/inode.c 2004-04-03 19:38:14.000000000 -0800
+++ linux-2.6.5.htlb/fs/hugetlbfs/inode.c 2004-04-05 16:09:41.000000000 -0700
@@ -43,6 +43,121 @@ static struct backing_dev_info hugetlbfs
.memory_backed = 1, /* Does not contribute to dirty memory */
};

+enum file_area_action {
+ INSERT,
+ FRONT_MERGE,
+ BACK_MERGE,
+ THREE_WAY_MERGE
+};
+
+/*
+ * return 0 if reservation is granted
+ */
+static int hugetlb_reserve_page(struct address_space *mapping,
+ struct vm_area_struct *vma)
+{
+ unsigned long block_start, block_end, resv;
+ struct list_head *p, *head;
+ struct file_area_struct *curr, *next;
+ enum file_area_action action;
+ int ret = -ENOMEM;
+
+ block_start = vma->vm_pgoff >> (HPAGE_SHIFT-PAGE_SHIFT);
+ block_end = block_start + ((vma->vm_end - vma->vm_start) >> HPAGE_SHIFT);
+
+ down(&mapping->i_shared_sem);
+
+ action = INSERT;
+ resv = block_end - block_start;
+ head = &mapping->private_list;
+ curr = next = NULL;
+ list_for_each(p, head) {
+ curr = list_entry(p, struct file_area_struct, list);
+ if (p->next != head)
+ next = list_entry(p->next, struct file_area_struct, list);
+
+ if (block_start <= curr->end) {
+ if (block_end <= curr->end) {
+ ret = 0;
+ goto out;
+ } else if (!next || block_end < next->start) {
+ resv = block_end - curr->end;
+ action = BACK_MERGE;
+ } else {
+ resv = next->start - curr->end;
+ action = THREE_WAY_MERGE;
+ }
+ } else if (!next || block_start < next->start) {
+ if (!next || block_end < next->start) {
+ resv = block_end - block_start;
+ action = INSERT;
+ } else {
+ curr = next;
+ resv = curr->start - block_start;
+ action = FRONT_MERGE;
+ }
+ }
+ else
+ continue;
+ }
+
+ /* check page reservation */
+ if (resv > (htlbzone_pages - htlbpage_resv))
+ goto out;
+
+ /* FIXME: check file system quota */
+
+ /* we have enough hugetlb page, go ahead reserve them */
+ switch(action) {
+ case BACK_MERGE:
+ curr->end = block_end;
+ break;
+ case FRONT_MERGE:
+ curr->start = block_start;
+ break;
+ case THREE_WAY_MERGE:
+ curr->end = next->end;
+ list_del(p->next);
+ kfree(next);
+ break;
+ case INSERT:
+ curr = kmalloc(sizeof(*curr), GFP_KERNEL);
+ if (!curr)
+ goto out;
+ curr->start = block_start;
+ curr->end = block_end;
+ list_add(&curr->list, p);
+ break;
+ }
+ htlbpage_resv += resv;
+ ret = 0;
+out:
+ up(&mapping->i_shared_sem);
+ return ret;
+}
+
+static void hugetlb_unreserve_page(struct address_space *mapping, loff_t lstart)
+{
+ struct file_area_struct *curr, *tmp;
+ unsigned long resv;
+
+ lstart >>= HPAGE_SHIFT;
+ down(&mapping->i_shared_sem);
+ list_for_each_entry_safe(curr, tmp, &mapping->private_list, list) {
+ if (lstart <= curr->start) {
+ resv = curr->end - curr->start;
+ list_del(&curr->list);
+ kfree(curr);
+ }
+ else {
+ resv = curr->end - lstart;
+ curr->end = lstart;
+ }
+ htlbpage_resv -= resv;
+ }
+ up(&mapping->i_shared_sem);
+}
+
static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_dentry->d_inode;
@@ -59,6 +174,9 @@ static int hugetlbfs_file_mmap(struct fi
if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
return -EINVAL;

+ if (hugetlb_reserve_page(mapping, vma))
+ return -ENOMEM;
+
vma_len = (loff_t)(vma->vm_end - vma->vm_start);

down(&inode->i_sem);
@@ -186,6 +304,7 @@ void truncate_hugepages(struct address_s
huge_pagevec_release(&pvec);
}
BUG_ON(!lstart && mapping->nrpages);
+ hugetlb_unreserve_page(mapping, lstart);
}

static void hugetlbfs_delete_inode(struct inode *inode)
diff -Nurp linux-2.6.5/include/linux/hugetlb.h linux-2.6.5.htlb/include/linux/hugetlb.h
--- linux-2.6.5/include/linux/hugetlb.h 2004-04-03 19:37:06.000000000 -0800
+++ linux-2.6.5.htlb/include/linux/hugetlb.h 2004-04-05 16:09:41.000000000 -0700
@@ -30,6 +30,8 @@ int is_aligned_hugepage_range(unsigned l
int pmd_huge(pmd_t pmd);

extern int htlbpage_max;
+extern long htlbzone_pages;
+extern long htlbpage_resv;

static inline void
mark_mm_hugetlb(struct mm_struct *mm, struct vm_area_struct *vma)
@@ -103,6 +105,12 @@ struct hugetlbfs_sb_info {
spinlock_t stat_lock;
};

+struct file_area_struct {
+ struct list_head list;
+ unsigned long start;
+ unsigned long end;
+};
+
static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
{
return sb->s_fs_info;


2004-04-06 01:59:56

by Ray Bryant

[permalink] [raw]
Subject: Re: [Lse-tech] RE: [PATCH] HUGETLB memory commitment

Hi Ken,

Chen, Kenneth W wrote:
>>>>>Ray Bryant wrote on Monday, April 05, 2004 11:22 AM
>>>
>>>Chen, Kenneth W wrote:
>>>I actually started coding yesterday. It doesn't look too bad (I think).
>>>I will post it once I finished it up later today or tomorrow.
>>
>>Hmmm...so did I. Oh well. We can pull the good ideas from both. :-)
>
>
> I did have a revelation from your original demand-paging patch with per-inode
> tracking ;-) I extended it into tracking by struct address_space (so we don't
> pollute inode structure) and added per-block tracking. See patch at the end of
> this post. I admit I had very pessimistic thoughts until I saw your patch.
>

Cool!

Either way works, I think. I just used the u.generic_ip pointer because it
was there and convenient. It's intended to be a hook in the VFS layer for a
particular file system to add info to the inode, as near as I can tell, but
neither hugetlbfs nor shmfs use it, so it was free for the taking.

>
>
>>>There are still some oddity in lifetime of the huge page reservation,
>>>but that can be discussed once everyone sees the code.
>
>
> I was thinking the lifetime of the huge page reservation should be the life
> of a mapping, i.e., only persist across mmap/munmap. That means add a ref
> count in the per-block tracking. This seriously complicates the design
> because now, ref count needs to be updated in munmap and fault_hander in
> addition to the mmap and truncate. Not to mention that Andy Whitcroft already
> pointed out we don't get notification from munmap. Plus it seriously make
> tracking logic complicated and have performance down side as well.
>
> I guess everyone is OK with reservation lives until file truncate?

One can certainly argue that the only thing that is required to live until
file truncate is the contents of the huge pages in the page cache, since
applications expect that the data will be there in the file/segment across
program executions until the file is truncated or the segment deleted.

But it certainly makes sense to me that if program A creates an mmap()'d file
of 10 huge pages, that if Program B comes along later and re-mmaps() that same
file, that Program B will be guaranteed to be able to touch all 10 pages, even
if Program A only touched 5. So that is an argument for having the
reservation last until file truncate/segment removal time.

Additionally, recall that we are trying to emulate the behavior of the
hugetlb_prefault() implementation. Under that implementation, if Program A
would mmap() 10 huge pages, then Program B would be guarenteed not to get a
SIGBUS when it mmap()'s and references those 10 pages, provided only that the
underlying file/segment was not deleted in between execution of the two programs.

So, I think we >>have<< to have the reservation last until file
truncate/segment deletion time. Fortunately, that turns out to be easier to
implement as well. :-)

I'll check through your patch and make sure we've both covered the same bases
there. If so, we should be good to go with either version.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-04-06 16:13:26

by Andy Whitcroft

[permalink] [raw]
Subject: RE: [Lse-tech] RE: [PATCH] HUGETLB memory commitment

--On 05 April 2004 16:18 -0700 "Chen, Kenneth W" <[email protected]> wrote:

>>>>> Ray Bryant wrote on Monday, April 05, 2004 11:22 AM
>> > Chen, Kenneth W wrote:
>> > I actually started coding yesterday. It doesn't look too bad (I think).
>> > I will post it once I finished it up later today or tomorrow.
>>
>> Hmmm...so did I. Oh well. We can pull the good ideas from both. :-)

Bugger, so am I. Someone will have to merge :)

> + /* we have enough hugetlb page, go ahead reserve them */
> + switch(action) {
> + case BACK_MERGE:
> + curr->end = block_end;
> + break;
> + case FRONT_MERGE:
> + curr->start = block_start;
> + break;
> + case THREE_WAY_MERGE:
> + curr->end = next->end;
> + list_del(p->next);
> + kfree(next);
> + break;

I don't know if I have read this right, but if I have then you only support
overlapping with two existing extents? What if there are extents from 0-4,
6-8 and 10-12 when you map 0-16? Will that not corrupt the list?

Anyhow, below is a work in progress, ie it compiles and boots and passes
the tests I've applied (not tested error handling well yet). The regions
accumulation code has been extensively tested in a user level test harness,
so I am fairly sure it works. I have split the request and commit phases
for the region handling to allow simpler backout on other failure such as
quota (which remains to be fixed).

There is definatly debug and extra unused code in there ... Comments etc
appreciated.

-apw

[070-hugetlb_tracking_R6]

---
fs/hugetlbfs/inode.c | 277 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/proc/proc_misc.c | 1
include/linux/hugetlb.h | 5
3 files changed, 278 insertions(+), 5 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/hugetlbfs/inode.c current/fs/hugetlbfs/inode.c
--- reference/fs/hugetlbfs/inode.c 2004-03-25 02:43:00.000000000 +0000
+++ current/fs/hugetlbfs/inode.c 2004-04-06 17:48:17.000000000 +0100
@@ -32,6 +32,234 @@
/* some random number */
#define HUGETLBFS_MAGIC 0x958458f6

+#define HUGETLBFS_NOACCT (~0UL)
+
+atomic_t hugetlb_committed_space = ATOMIC_INIT(0);
+
+int hugetlb_acct_memory(long delta)
+{
+ atomic_add(delta, &hugetlb_committed_space);
+ if (delta > 0 && atomic_read(&hugetlb_committed_space) >
+ hugetlb_total_pages()) {
+ atomic_add(-delta, &hugetlb_committed_space);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+#if 0
+int hugetlb_charge_page(struct vm_area_struct *vma)
+{
+ int ret;
+
+ /* if this file is marked for commit on demand then see if we can
+ * commmit a page, if so account for it against this file. */
+ if (vma->vm_file->f_dentry->d_inode->i_blocks != ~0) {
+ ret = hugetlb_acct_memory(HPAGE_SIZE / PAGE_SIZE);
+ if (ret)
+ return ret;
+ vma->vm_file->f_dentry->d_inode->i_blocks++;
+ }
+ return 0;
+}
+int hugetlb_uncharge_page(struct vm_area_struct *vma)
+{
+ /* if this file is marked for commit on demand return a page. */
+ if (vma->vm_file->f_dentry->d_inode->i_blocks != ~0) {
+ hugetlb_acct_memory(-(HPAGE_SIZE / PAGE_SIZE));
+ vma->vm_file->f_dentry->d_inode->i_blocks--;
+ }
+ return 0;
+}
+#endif
+
+struct file_region {
+ struct list_head link;
+ loff_t from;
+ loff_t to;
+};
+
+int region_add(struct list_head *head, loff_t f, loff_t t)
+{
+ struct file_region *rg;
+ struct file_region *nrg;
+ struct file_region *trg;
+
+ printk(KERN_WARNING "region_add: head<%p> f<%lld> t<%lld>\n",
+ head, f, t);
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* Add a new region if the existing region starts above our end. */
+ if (!rg || t < rg->from) {
+ printk(KERN_WARNING "region_add: existing region missing\n");
+ return -EINVAL;
+ }
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (f > rg->from)
+ f = rg->from;
+
+ /* Check for and consume any regions we now overlap with. */
+ nrg = rg;
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > t)
+ break;
+
+ /* If this area reaches higher then extend our area to
+ * include it completely. If this is not the first area
+ * which we intend to reuse, free it. */
+ if (rg->to > t)
+ t = rg->to;
+ printk(KERN_WARNING "region: consume %p %lld %lld\n",
+ rg, rg->from, rg->to);
+
+ if (rg != nrg) {
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ }
+ nrg->from = f;
+ nrg->to = t;
+ return 0;
+}
+
+int region_chg(struct list_head *head, loff_t f, loff_t t)
+{
+ struct file_region *rg;
+ struct file_region *nrg;
+ loff_t chg = 0;
+
+ printk(KERN_WARNING "region_chg: head<%p> f<%lld> t<%lld>\n",
+ head, f, t);
+
+ /* Locate the region we are before or in. */
+ list_for_each_entry(rg, head, link)
+ if (f <= rg->to)
+ break;
+
+ /* If we are below the current region then a new region is required.
+ * Subtle, allocate a new region at the position but make it zero
+ * size such that we can guarentee to record the reservation. */
+ if (&rg->link == head || t < rg->from) {
+ nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+ if (nrg == 0)
+ return -ENOMEM;
+ nrg->from = f;
+ nrg->to = f;
+ INIT_LIST_HEAD(&nrg->link);
+ list_add(&nrg->link, rg->link.prev);
+
+ printk(KERN_WARNING "region: new %p %lld %lld\n",
+ nrg, nrg->from, nrg->to);
+ return t - f;
+ }
+
+ /* Round our left edge to the current segment if it encloses us. */
+ if (f > rg->from)
+ f = rg->from;
+ chg = t - f;
+
+ /* Check for and consume any regions we now overlap with. */
+ list_for_each_entry(rg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ if (rg->from > t)
+ return chg;
+
+ /* We overlap with this area, if it extends futher than
+ * us then we must extend ourselves. Account for its
+ * existing reservation. */
+ if (rg->to > t) {
+ chg += rg->to - t;
+ t = rg->to;
+ }
+ chg -= rg->to - rg->from;
+ }
+ return chg;
+}
+
+int region_truncate(struct list_head *head, loff_t end)
+{
+ struct file_region *rg;
+ struct file_region *trg;
+ int chg = 0;
+
+ printk(KERN_WARNING "region_truncate: head<%p> end<%lld>\n",
+ head, end);
+
+ /* Locate the region we are either in or before. */
+ list_for_each_entry(rg, head, link)
+ if (end <= rg->to)
+ break;
+ if (&rg->link == head)
+ return 0;
+
+ /* If we are in the middle of a region then adjust it. */
+ if (end > rg->from) {
+ chg = rg->to - end;
+ rg->to = end;
+ rg = list_entry(rg->link.next, typeof(*rg), link);
+ }
+
+ /* Drop any remaining regions. */
+ list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+ if (&rg->link == head)
+ break;
+ chg += rg->to - rg->from;
+ list_del(&rg->link);
+ kfree(rg);
+ }
+ return chg;
+}
+
+
+int region_dump(struct list_head *head)
+{
+ struct file_region *rg;
+
+ list_for_each_entry(rg, head, link)
+ printk(KERN_WARNING "rg<%p> f<%lld> t<%lld>\n",
+ rg, rg->from, rg->to);
+ return 0;
+}
+
+int hugetlb_acct_prepare(struct inode *inode, int from, int to)
+{
+ int chg;
+
+ /* Calculate the commitment change implied by this mapping. */
+ chg = region_chg(&inode->i_mapping->private_list, from, to);
+ if (chg < 0)
+ return chg;
+ printk(KERN_WARNING "hugetlbfs_file_mmap: len<%d>\n", chg);
+ chg = hugetlb_acct_memory(chg);
+ if (chg < 0)
+ return chg;
+
+ return chg;
+}
+int hugetlb_acct_commit(struct inode *inode, int from, int to)
+{
+ return region_add(&inode->i_mapping->private_list, from, to);
+}
+int hugetlb_acct_undo(struct inode *inode, int chg)
+{
+ return hugetlb_acct_memory(-chg);
+}
+int hugetlbfs_report_meminfo(char *buf)
+{
+#define K(x) ((x) << (PAGE_SHIFT - 10))
+ long htlb = atomic_read(&hugetlb_committed_space);
+ return sprintf(buf, "HugeCommitted_AS: %5lu kB\n", K(htlb));
+#undef K
+}
+
static struct super_operations hugetlbfs_ops;
static struct address_space_operations hugetlbfs_aops;
struct file_operations hugetlbfs_file_operations;
@@ -49,6 +277,7 @@ static int hugetlbfs_file_mmap(struct fi
struct address_space *mapping = inode->i_mapping;
loff_t len, vma_len;
int ret;
+ int chg;

if (vma->vm_start & ~HPAGE_MASK)
return -EINVAL;
@@ -62,13 +291,33 @@ static int hugetlbfs_file_mmap(struct fi
vma_len = (loff_t)(vma->vm_end - vma->vm_start);

down(&inode->i_sem);
+
+ /* Calculate the commitment implied by this mapping. */
+ chg = hugetlb_acct_prepare(inode, vma->vm_pgoff,
+ vma->vm_pgoff + (vma_len >> PAGE_SHIFT));
+ if (chg < 0) {
+ ret = chg;
+ goto unlock_out;
+ }
+ printk(KERN_WARNING "hugetlbfs_file_mmap: len<%d>\n", chg);
+
+ /* FIXME, check the quota here, before we commit the change. */
+
file_accessed(file);
vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
vma->vm_ops = &hugetlb_vm_ops;
ret = hugetlb_prefault(mapping, vma);
len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
- if (ret == 0 && inode->i_size < len)
- inode->i_size = len;
+ if (ret == 0) {
+ if (inode->i_size < len)
+ inode->i_size = len;
+ /* Record the commitment. */
+ hugetlb_acct_commit(inode, vma->vm_pgoff,
+ vma->vm_pgoff + (vma_len >> PAGE_SHIFT));
+ } else
+ hugetlb_acct_undo(inode, chg);
+
+unlock_out:
up(&inode->i_sem);

return ret;
@@ -191,6 +440,7 @@ void truncate_hugepages(struct address_s
static void hugetlbfs_delete_inode(struct inode *inode)
{
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(inode->i_sb);
+ int chg;

hlist_del_init(&inode->i_hash);
list_del_init(&inode->i_list);
@@ -200,6 +450,8 @@ static void hugetlbfs_delete_inode(struc

if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);
+ chg = region_truncate(&inode->i_mapping->private_list, 0);
+ hugetlb_acct_memory(-chg);

security_inode_delete(inode);

@@ -217,6 +469,7 @@ static void hugetlbfs_forget_inode(struc
{
struct super_block *super_block = inode->i_sb;
struct hugetlbfs_sb_info *sbinfo = HUGETLBFS_SB(super_block);
+ int chg;

if (hlist_unhashed(&inode->i_hash))
goto out_truncate;
@@ -241,6 +494,8 @@ out_truncate:
spin_unlock(&inode_lock);
if (inode->i_data.nrpages)
truncate_hugepages(&inode->i_data, 0);
+ chg = region_truncate(&inode->i_mapping->private_list, 0);
+ hugetlb_acct_memory(-chg);

if (sbinfo->free_inodes >= 0) {
spin_lock(&sbinfo->stat_lock);
@@ -311,6 +566,7 @@ static int hugetlb_vmtruncate(struct ino
{
unsigned long pgoff;
struct address_space *mapping = inode->i_mapping;
+ int chg;

if (offset > inode->i_size)
return -EINVAL;
@@ -326,6 +582,8 @@ static int hugetlb_vmtruncate(struct ino
hugetlb_vmtruncate_list(&mapping->i_mmap_shared, pgoff);
up(&mapping->i_shared_sem);
truncate_hugepages(mapping, offset);
+ chg = region_truncate(&mapping->private_list, offset);
+ hugetlb_acct_memory(-chg);
return 0;
}

@@ -350,6 +608,10 @@ static int hugetlbfs_setattr(struct dent
error = hugetlb_vmtruncate(inode, attr->ia_size);
if (error)
goto out;
+ /* We rely on the fact that the sizes are hugepage aligned,
+ * and that hugetlb_vmtruncate prevents extend. */
+ hugetlb_acct_memory((attr->ia_size - i_size_read(inode)) /
+ PAGE_SIZE);
attr->ia_valid &= ~ATTR_SIZE;
}
error = inode_setattr(inode, attr);
@@ -382,6 +644,7 @@ static struct inode *hugetlbfs_get_inode
inode->i_blocks = 0;
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
+ INIT_LIST_HEAD(&inode->i_mapping->private_list);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
default:
@@ -710,9 +973,6 @@ struct file *hugetlb_zero_setup(size_t s
if (!capable(CAP_IPC_LOCK))
return ERR_PTR(-EPERM);

- if (!is_hugepage_mem_enough(size))
- return ERR_PTR(-ENOMEM);
-
root = hugetlbfs_vfsmount->mnt_root;
snprintf(buf, 16, "%lu", hugetlbfs_counter());
quick_string.name = buf;
@@ -736,11 +996,18 @@ struct file *hugetlb_zero_setup(size_t s
d_instantiate(dentry, inode);
inode->i_size = size;
inode->i_nlink = 0;
+ inode->i_blocks = HUGETLBFS_NOACCT;
file->f_vfsmnt = mntget(hugetlbfs_vfsmount);
file->f_dentry = dentry;
file->f_mapping = inode->i_mapping;
file->f_op = &hugetlbfs_file_operations;
file->f_mode = FMODE_WRITE | FMODE_READ;
+
+ error = hugetlb_acct_prepare(inode, 0, size / PAGE_SIZE);
+ if (error < 0)
+ goto out_file;
+ hugetlb_acct_commit(inode, 0, size / PAGE_SIZE);
+
return file;

out_file:
diff -X /home/apw/lib/vdiff.excl -rupN reference/fs/proc/proc_misc.c current/fs/proc/proc_misc.c
--- reference/fs/proc/proc_misc.c 2004-04-02 00:37:04.000000000 +0100
+++ current/fs/proc/proc_misc.c 2004-04-01 22:51:19.000000000 +0100
@@ -232,6 +232,7 @@ static int meminfo_read_proc(char *page,
);

len += hugetlb_report_meminfo(page + len);
+ len += hugetlbfs_report_meminfo(page + len);

return proc_calc_metrics(page, start, off, count, eof, len);
#undef K
diff -X /home/apw/lib/vdiff.excl -rupN reference/include/linux/hugetlb.h current/include/linux/hugetlb.h
--- reference/include/linux/hugetlb.h 2004-04-02 00:38:24.000000000 +0100
+++ current/include/linux/hugetlb.h 2004-04-01 22:51:19.000000000 +0100
@@ -115,11 +115,16 @@ static inline void set_file_hugepages(st
{
file->f_op = &hugetlbfs_file_operations;
}
+int hugetlbfs_report_meminfo(char *);
+int hugetlb_charge_page(struct vm_area_struct *vma);
+int hugetlb_uncharge_page(struct vm_area_struct *vma);
+
#else /* !CONFIG_HUGETLBFS */

#define is_file_hugepages(file) 0
#define set_file_hugepages(file) BUG()
#define hugetlb_zero_setup(size) ERR_PTR(-ENOSYS)
+#define hugetlbfs_report_meminfo(buf) 0

#endif /* !CONFIG_HUGETLBFS */


2004-04-06 17:41:23

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [Lse-tech] RE: [PATCH] HUGETLB memory commitment

>>>> Andy Whitcroft wrote on Tue, April 06, 2004 9:14 AM
> >>>>> Ray Bryant wrote on Monday, April 05, 2004 11:22 AM
> >> > Chen, Kenneth W wrote:
> >> > I actually started coding yesterday. It doesn't look too bad (I think).
> >> > I will post it once I finished it up later today or tomorrow.
> >>
> >> Hmmm...so did I. Oh well. We can pull the good ideas from both. :-)
>
> Bugger, so am I. Someone will have to merge :)
>
> > + /* we have enough hugetlb page, go ahead reserve them */
> > + switch(action) {
> > + case BACK_MERGE:
> > + curr->end = block_end;
> > + break;
> > + case FRONT_MERGE:
> > + curr->start = block_start;
> > + break;
> > + case THREE_WAY_MERGE:
> > + curr->end = next->end;
> > + list_del(p->next);
> > + kfree(next);
> > + break;
>
> I don't know if I have read this right, but if I have then you only support
> overlapping with two existing extents? What if there are extents from 0-4,
> 6-8 and 10-12 when you map 0-16? Will that not corrupt the list?


Doh, you are absolutely right that this code is broken in that scenario.



> Anyhow, below is a work in progress, ie it compiles and boots and passes
> the tests I've applied (not tested error handling well yet). The regions
> accumulation code has been extensively tested in a user level test harness,
> so I am fairly sure it works. I have split the request and commit phases
> for the region handling to allow simpler backout on other failure such as
> quota (which remains to be fixed).
>
>
> @@ -736,11 +996,18 @@ struct file *hugetlb_zero_setup(size_t s
> d_instantiate(dentry, inode);
> inode->i_size = size;
> inode->i_nlink = 0;
> + inode->i_blocks = HUGETLBFS_NOACCT;
> file->f_vfsmnt = mntget(hugetlbfs_vfsmount);
> file->f_dentry = dentry;
> file->f_mapping = inode->i_mapping;
> file->f_op = &hugetlbfs_file_operations;
> file->f_mode = FMODE_WRITE | FMODE_READ;
> +
> + error = hugetlb_acct_prepare(inode, 0, size / PAGE_SIZE);
> + if (error < 0)
> + goto out_file;
> + hugetlb_acct_commit(inode, 0, size / PAGE_SIZE);
> +
> return file;

Is this absolutely necessary? There is no vma associated at shmget().
shmat() eventually calls mmap, and that is already covered.

- Ken