2008-03-16 17:31:24

by Balbir Singh

[permalink] [raw]
Subject: [RFC][0/3] Virtual address space control for cgroups

This is an early patchset for virtual address space control for cgroups.
The patches are against 2.6.25-rc5-mm1 and have been tested on top of
User Mode Linux.

The first patch adds the user interface. The second patch adds accounting
and control. The third patch updates documentation.

Review suggestions would be appreciated along the lines of

1. What is missing? Are all virtual address space expansion points covered?
2. Do we need to account and control address space at insert_special_mapping?
3. Address space accounting may contain duplications. Do we need to avoid it?

Comments?

series

memory-controller-virtual-address-space-control-user-interface
memory-controller-virtual-address-space-accounting-and-control
memory-controller-virtual-address-control-documentation



--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL


2008-03-16 17:31:38

by Balbir Singh

[permalink] [raw]
Subject: [RFC][1/3] Add user interface for virtual address space control



Add as_usage_in_bytes and as_limit_in_bytes interfaces. These provide
control over the total address space that the processes combined together
in the cgroup can grow upto. This functionality is analogous to
the RLIMIT_AS function of the getrlimit(2) and setrlimit(2) calls.
A as_res resource counter is added to the mem_cgroup structure. The
as_res counter handles all the accounting associated with the virtual
address space accounting and control of cgroups.

Signed-off-by: Balbir Singh <[email protected]>
---

mm/memcontrol.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-control-user-interface mm/memcontrol.c
--- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-control-user-interface 2008-03-16 22:57:38.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/memcontrol.c 2008-03-16 22:57:38.000000000 +0530
@@ -128,6 +128,10 @@ struct mem_cgroup {
*/
struct res_counter res;
/*
+ * Address space limits
+ */
+ struct res_counter as_res;
+ /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -870,6 +874,21 @@ static ssize_t mem_cgroup_write(struct c
mem_cgroup_write_strategy);
}

+static u64 mem_cgroup_as_read(struct cgroup *cont, struct cftype *cft)
+{
+ return res_counter_read_u64(&mem_cgroup_from_cont(cont)->as_res,
+ cft->private);
+}
+
+static ssize_t mem_cgroup_as_write(struct cgroup *cont, struct cftype *cft,
+ struct file *file, const char __user *userbuf,
+ size_t nbytes, loff_t *ppos)
+{
+ return res_counter_write(&mem_cgroup_from_cont(cont)->as_res,
+ cft->private, userbuf, nbytes, ppos,
+ mem_cgroup_write_strategy);
+}
+
static ssize_t mem_force_empty_write(struct cgroup *cont,
struct cftype *cft, struct file *file,
const char __user *userbuf,
@@ -931,6 +950,17 @@ static struct cftype mem_cgroup_files[]
.read_u64 = mem_cgroup_read,
},
{
+ .name = "as_usage_in_bytes",
+ .private = RES_USAGE,
+ .read_u64 = mem_cgroup_as_read,
+ },
+ {
+ .name = "as_limit_in_bytes",
+ .private = RES_LIMIT,
+ .write = mem_cgroup_as_write,
+ .read_u64 = mem_cgroup_as_read,
+ },
+ {
.name = "failcnt",
.private = RES_FAILCNT,
.read_u64 = mem_cgroup_read,
@@ -999,6 +1029,7 @@ mem_cgroup_create(struct cgroup_subsys *
return ERR_PTR(-ENOMEM);

res_counter_init(&mem->res);
+ res_counter_init(&mem->as_res);

memset(&mem->info, 0, sizeof(mem->info));

diff -puN include/linux/memcontrol.h~memory-controller-virtual-address-space-control-user-interface include/linux/memcontrol.h
_

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-16 17:31:53

by Balbir Singh

[permalink] [raw]
Subject: [RFC][2/3] Account and control virtual address space allocations



This patch implements accounting and control of virtual address space.
Accounting is done when the virtual address space of any task/mm_struct
belonging to the cgroup is incremented or decremented. This patch
fails the expansion if the cgroup goes over its limit. A new function
mem_cgroup_update_as() is added to deal with the accounting of the virtual
address space usage of cgroups.

TODOs

1. IA64 has code in perfmon.c pfm_smpl_buffer_alloc(), which increments
the total_vm of the mm_struct. This code has not yet been brought into
virtual address space control
2. Only when CONFIG_MMU is enabled, is the virtual address space control
enabled. Should we do this for nommu cases as well? My suspicion is
that we don't have to.

Signed-off-by: Balbir Singh <[email protected]>
---

arch/x86/kernel/ptrace.c | 10 +++++++++-
include/linux/memcontrol.h | 7 +++++++
init/Kconfig | 4 +++-
kernel/fork.c | 9 +++++++--
mm/memcontrol.c | 37 +++++++++++++++++++++++++++++++++++++
mm/memory.c | 5 +++++
mm/mmap.c | 22 ++++++++++++++++++++--
mm/mremap.c | 21 ++++++++++++++++++---
8 files changed, 106 insertions(+), 9 deletions(-)

diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control mm/memcontrol.c
--- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/memcontrol.c 2008-03-16 22:57:40.000000000 +0530
@@ -525,6 +525,32 @@ unsigned long mem_cgroup_isolate_pages(u
}

/*
+ * Check if the current cgroup exceeds its address space limit.
+ * Returns 0 on success and 1 on failure.
+ */
+int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
+{
+ int ret = 0;
+ struct mem_cgroup *mem;
+ if (mem_cgroup_subsys.disabled)
+ return ret;
+
+ rcu_read_lock();
+ mem = rcu_dereference(mm->mem_cgroup);
+ css_get(&mem->css);
+ rcu_read_unlock();
+
+ if (nr_pages > 0) {
+ if (res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE)))
+ ret = 1;
+ } else
+ res_counter_uncharge(&mem->as_res, (-nr_pages * PAGE_SIZE));
+
+ css_put(&mem->css);
+ return ret;
+}
+
+/*
* Charge the memory controller for page usage.
* Return
* 0 if the charge was successful
@@ -1103,6 +1129,17 @@ static void mem_cgroup_move_task(struct
goto out;

css_get(&mem->css);
+ /*
+ * For address space accounting, the charges are migrated.
+ * We need to migrate it since all the future uncharge/charge will
+ * now happen to the new cgroup. For consistency, we need to migrate
+ * all charges, otherwise we could end up dropping charges from
+ * the new cgroup (even though they were incurred in the current
+ * group).
+ */
+ if (res_counter_charge(&mem->as_res, mm->total_vm))
+ goto out;
+ res_counter_uncharge(&old_mem->as_res, mm->total_vm);
rcu_assign_pointer(mm->mem_cgroup, mem);
css_put(&old_mem->css);

diff -puN include/linux/memcontrol.h~memory-controller-virtual-address-space-accounting-and-control include/linux/memcontrol.h
--- linux-2.6.25-rc5/include/linux/memcontrol.h~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
+++ linux-2.6.25-rc5-balbir/include/linux/memcontrol.h 2008-03-16 22:57:40.000000000 +0530
@@ -54,6 +54,7 @@ int task_in_mem_cgroup(struct task_struc
extern int mem_cgroup_prepare_migration(struct page *page);
extern void mem_cgroup_end_migration(struct page *page);
extern void mem_cgroup_page_migration(struct page *page, struct page *newpage);
+extern int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages);

/*
* For memory reclaim.
@@ -172,6 +173,12 @@ static inline long mem_cgroup_calc_recla
{
return 0;
}
+
+static inline int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
+{
+ return 0;
+}
+
#endif /* CONFIG_CGROUP_MEM_CONT */

#endif /* _LINUX_MEMCONTROL_H */
diff -puN mm/mmap.c~memory-controller-virtual-address-space-accounting-and-control mm/mmap.c
--- linux-2.6.25-rc5/mm/mmap.c~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/mmap.c 2008-03-16 22:57:40.000000000 +0530
@@ -1117,6 +1117,9 @@ munmap_back:
}
}

+ if (mem_cgroup_update_as(mm, len >> PAGE_SHIFT))
+ return -ENOMEM;
+
/*
* Can we just expand an old private anonymous mapping?
* The VM_SHARED test is necessary because shmem_zero_setup
@@ -1226,8 +1229,11 @@ unmap_and_free_vma:
free_vma:
kmem_cache_free(vm_area_cachep, vma);
unacct_error:
- if (charged)
+ if (charged) {
+ mem_cgroup_update_as(mm, -charged);
vm_unacct_memory(charged);
+ }
+unacct_as_error:
return error;
}

@@ -1555,6 +1561,9 @@ static int acct_stack_growth(struct vm_a
if (security_vm_enough_memory(grow))
return -ENOMEM;

+ if (mem_cgroup_update_as(mm, grow))
+ return -ENOMEM;
+
/* Ok, everything looks good - let it rip */
mm->total_vm += grow;
if (vma->vm_flags & VM_LOCKED)
@@ -2003,9 +2012,14 @@ unsigned long do_brk(unsigned long addr,
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;

- if (security_vm_enough_memory(len >> PAGE_SHIFT))
+ if (mem_cgroup_update_as(mm, (len >> PAGE_SHIFT)))
return -ENOMEM;

+ if (security_vm_enough_memory(len >> PAGE_SHIFT)) {
+ mem_cgroup_update_as(mm, -(len >> PAGE_SHIFT));
+ return -ENOMEM;
+ }
+
/* Can we just expand an old private anonymous mapping? */
if (vma_merge(mm, prev, addr, addr + len, flags,
NULL, NULL, pgoff, NULL))
@@ -2236,6 +2250,9 @@ int install_special_mapping(struct mm_st
if (unlikely(vma == NULL))
return -ENOMEM;

+ if (mem_cgroup_update_as(mm, len >> PAGE_SHIFT))
+ return -ENOMEM;
+
vma->vm_mm = mm;
vma->vm_start = addr;
vma->vm_end = addr + len;
@@ -2248,6 +2265,7 @@ int install_special_mapping(struct mm_st

if (unlikely(insert_vm_struct(mm, vma))) {
kmem_cache_free(vm_area_cachep, vma);
+ mem_cgroup_update_as(mm, -(len >> PAGE_SHIFT));
return -ENOMEM;
}

diff -puN arch/x86/kernel/ptrace.c~memory-controller-virtual-address-space-accounting-and-control arch/x86/kernel/ptrace.c
--- linux-2.6.25-rc5/arch/x86/kernel/ptrace.c~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
+++ linux-2.6.25-rc5-balbir/arch/x86/kernel/ptrace.c 2008-03-16 22:57:40.000000000 +0530
@@ -20,6 +20,7 @@
#include <linux/audit.h>
#include <linux/seccomp.h>
#include <linux/signal.h>
+#include <linux/memcontrol.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -787,6 +788,8 @@ static int ptrace_bts_realloc(struct tas
current->mm->total_vm -= old_size;
current->mm->locked_vm -= old_size;

+ mem_cgroup_update_as(current->mm, -old_size);
+
if (size == 0)
goto out;

@@ -816,10 +819,15 @@ static int ptrace_bts_realloc(struct tas
goto out;
}

+ if (mem_cgroup_update_as(current->mm, size))
+ goto out;
+
ret = ds_allocate((void **)&child->thread.ds_area_msr,
size << PAGE_SHIFT);
- if (ret < 0)
+ if (ret < 0) {
+ mem_cgroup_update_as(current->mm, -size);
goto out;
+ }

current->mm->total_vm += size;
current->mm->locked_vm += size;
diff -puN kernel/fork.c~memory-controller-virtual-address-space-accounting-and-control kernel/fork.c
--- linux-2.6.25-rc5/kernel/fork.c~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
+++ linux-2.6.25-rc5-balbir/kernel/fork.c 2008-03-16 22:57:40.000000000 +0530
@@ -53,6 +53,7 @@
#include <linux/tty.h>
#include <linux/proc_fs.h>
#include <linux/blkdev.h>
+#include <linux/memcontrol.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -237,6 +238,7 @@ static int dup_mmap(struct mm_struct *mm

for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
struct file *file;
+ unsigned int len = vma_pages(mpnt);

if (mpnt->vm_flags & VM_DONTCOPY) {
long pages = vma_pages(mpnt);
@@ -247,11 +249,12 @@ static int dup_mmap(struct mm_struct *mm
}
charge = 0;
if (mpnt->vm_flags & VM_ACCOUNT) {
- unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
if (security_vm_enough_memory(len))
goto fail_nomem;
charge = len;
}
+ if (mem_cgroup_update_as(mm, len))
+ goto fail_nomem_as;
tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (!tmp)
goto fail_nomem;
@@ -311,8 +314,10 @@ out:
fail_nomem_policy:
kmem_cache_free(vm_area_cachep, tmp);
fail_nomem:
- retval = -ENOMEM;
+ mem_cgroup_update_as(mm, -charge);
vm_unacct_memory(charge);
+fail_nomem_as:
+ retval = -ENOMEM;
goto out;
}

diff -puN mm/mremap.c~memory-controller-virtual-address-space-accounting-and-control mm/mremap.c
--- linux-2.6.25-rc5/mm/mremap.c~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/mremap.c 2008-03-16 22:57:40.000000000 +0530
@@ -174,10 +174,15 @@ static unsigned long move_vma(struct vm_
if (mm->map_count >= sysctl_max_map_count - 3)
return -ENOMEM;

+ if (mem_cgroup_update_as(mm, new_len >> PAGE_SHIFT))
+ return -ENOMEM;
+
new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff);
- if (!new_vma)
+ if (!new_vma) {
+ mem_cgroup_update_as(mm, -(new_len >> PAGE_SHIFT));
return -ENOMEM;
+ }

moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len);
if (moved_len < old_len) {
@@ -187,6 +192,7 @@ static unsigned long move_vma(struct vm_
* and then proceed to unmap new area instead of old.
*/
move_page_tables(new_vma, new_addr, vma, old_addr, moved_len);
+ mem_cgroup_update_as(mm, -(new_len >> PAGE_SHIFT));
vma = new_vma;
old_len = new_len;
old_addr = new_addr;
@@ -347,10 +353,17 @@ unsigned long do_mremap(unsigned long ad
goto out;
}

+ if (mem_cgroup_update_as(mm, (new_len - old_len) >> PAGE_SHIFT)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
if (vma->vm_flags & VM_ACCOUNT) {
charged = (new_len - old_len) >> PAGE_SHIFT;
- if (security_vm_enough_memory(charged))
+ if (security_vm_enough_memory(charged)) {
+ mem_cgroup_update_as(mm, -charged);
goto out_nc;
+ }
}

/* old_len exactly to the end of the area..
@@ -406,8 +419,10 @@ unsigned long do_mremap(unsigned long ad
ret = move_vma(vma, addr, old_len, new_len, new_addr);
}
out:
- if (ret & ~PAGE_MASK)
+ if (ret & ~PAGE_MASK) {
vm_unacct_memory(charged);
+ mem_cgroup_update_as(mm, -charged);
+ }
out_nc:
return ret;
}
diff -puN init/Kconfig~memory-controller-virtual-address-space-accounting-and-control init/Kconfig
--- linux-2.6.25-rc5/init/Kconfig~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
+++ linux-2.6.25-rc5-balbir/init/Kconfig 2008-03-16 22:57:40.000000000 +0530
@@ -369,7 +369,9 @@ config CGROUP_MEM_RES_CTLR
depends on CGROUPS && RESOURCE_COUNTERS
help
Provides a memory resource controller that manages both page cache and
- RSS memory.
+ RSS memory. It also provide accounting and control of address
+ space allocations (along the lines of RLIMIT_AS) for cgroups
+ when CONFIG_MMU is enabled.

Note that setting this option increases fixed memory overhead
associated with each page of memory in the system by 4/8 bytes
diff -puN mm/swapfile.c~memory-controller-virtual-address-space-accounting-and-control mm/swapfile.c
diff -puN mm/memory.c~memory-controller-virtual-address-space-accounting-and-control mm/memory.c
--- linux-2.6.25-rc5/mm/memory.c~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/memory.c 2008-03-16 22:57:40.000000000 +0530
@@ -838,6 +838,11 @@ unsigned long unmap_vmas(struct mmu_gath

if (vma->vm_flags & VM_ACCOUNT)
*nr_accounted += (end - start) >> PAGE_SHIFT;
+ /*
+ * Unaccount used virtual memory for cgroups
+ */
+ mem_cgroup_update_as(vma->vm_mm,
+ ((long)(start - end)) >> PAGE_SHIFT);

while (start != end) {
if (!tlb_start_valid) {
_

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-16 17:32:12

by Balbir Singh

[permalink] [raw]
Subject: [RFC][3/3] Update documentation for virtual address space control



This patch adds documentation for virtual address space control.

Signed-off-by: Balbir Singh <[email protected]>
---

Documentation/controllers/memory.txt | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff -puN Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation Documentation/controllers/memory.txt
--- linux-2.6.25-rc5/Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation 2008-03-16 22:57:44.000000000 +0530
+++ linux-2.6.25-rc5-balbir/Documentation/controllers/memory.txt 2008-03-16 22:57:44.000000000 +0530
@@ -237,7 +237,31 @@ cgroup might have some charge associated
tasks have migrated away from it. Such charges are automatically dropped at
rmdir() if there are no tasks.

-5. TODO
+5. Virtual address space accounting
+
+A new resource counter controls the address space expansion of the tasks in
+the cgroup. Address space control is provided along the same lines as
+RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2).
+The interface for controlling address space is provided through
+"as_limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t the user
+interface. Please see section 3 for more details on how to use the user
+interface to get and set values.
+
+The "as_usage_in_bytes" file provides information about the total address
+space usage of the cgroup in bytes.
+
+5.1 Advantages of providing this feature
+
+1. Control over virtual address space allows for a cgroup to fail gracefully
+ i.e, via a malloc or mmap failure as compared to OOM kill when no
+ pages can be reclaimed
+2. It provides better control over how many pages can be swapped out when
+ the cgroup goes over it's limit. A badly setup cgroup can cause excessive
+ swapping. Providing control over the address space allocations ensures
+ that the system administrator has control over the total swapping that
+ can take place.
+
+6. TODO

1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
_

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-16 18:34:20

by Randy Dunlap

[permalink] [raw]
Subject: Re: [RFC][3/3] Update documentation for virtual address space control

On Sun, 16 Mar 2008 23:00:17 +0530 Balbir Singh wrote:

> This patch adds documentation for virtual address space control.
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> Documentation/controllers/memory.txt | 26 +++++++++++++++++++++++++-
> 1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff -puN Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation Documentation/controllers/memory.txt
> --- linux-2.6.25-rc5/Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation 2008-03-16 22:57:44.000000000 +0530
> +++ linux-2.6.25-rc5-balbir/Documentation/controllers/memory.txt 2008-03-16 22:57:44.000000000 +0530
> @@ -237,7 +237,31 @@ cgroup might have some charge associated
> tasks have migrated away from it. Such charges are automatically dropped at
> rmdir() if there are no tasks.
>
> -5. TODO
> +5. Virtual address space accounting
> +
> +A new resource counter controls the address space expansion of the tasks in
> +the cgroup. Address space control is provided along the same lines as
> +RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2).
> +The interface for controlling address space is provided through
> +"as_limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t the user

w.r.t.
or even spelled out.

> +interface. Please see section 3 for more details on how to use the user
> +interface to get and set values.
> +
> +The "as_usage_in_bytes" file provides information about the total address
> +space usage of the cgroup in bytes.
> +
> +5.1 Advantages of providing this feature
> +
> +1. Control over virtual address space allows for a cgroup to fail gracefully
> + i.e, via a malloc or mmap failure as compared to OOM kill when no

i.e.,

> + pages can be reclaimed

end with period.

> +2. It provides better control over how many pages can be swapped out when
> + the cgroup goes over it's limit. A badly setup cgroup can cause excessive

its (not "it is")

> + swapping. Providing control over the address space allocations ensures
> + that the system administrator has control over the total swapping that
> + can take place.
> +
> +6. TODO
>
> 1. Add support for accounting huge pages (as a separate controller)
> 2. Make per-cgroup scanner reclaim not-shared pages first
> _

---
~Randy

2008-03-16 23:27:49

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][0/3] Virtual address space control for cgroups

On Mon, Mar 17, 2008 at 1:29 AM, Balbir Singh <[email protected]> wrote:
> This is an early patchset for virtual address space control for cgroups.
> The patches are against 2.6.25-rc5-mm1 and have been tested on top of
> User Mode Linux.

What's the performance hit of doing these accounting checks on every
mmap/munmap? If it's not totally lost in the noise, couldn't it be
made a separate control group, so that it could be just enabled (and
the performance hit taken) for users that actually want it?

Paul

2008-03-17 01:35:19

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][3/3] Update documentation for virtual address space control

Randy Dunlap wrote:
> On Sun, 16 Mar 2008 23:00:17 +0530 Balbir Singh wrote:
>
>> This patch adds documentation for virtual address space control.
>>
>> Signed-off-by: Balbir Singh <[email protected]>
>> ---
>>
>> Documentation/controllers/memory.txt | 26 +++++++++++++++++++++++++-
>> 1 file changed, 25 insertions(+), 1 deletion(-)
>>
>> diff -puN Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation Documentation/controllers/memory.txt
>> --- linux-2.6.25-rc5/Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation 2008-03-16 22:57:44.000000000 +0530
>> +++ linux-2.6.25-rc5-balbir/Documentation/controllers/memory.txt 2008-03-16 22:57:44.000000000 +0530
>> @@ -237,7 +237,31 @@ cgroup might have some charge associated
>> tasks have migrated away from it. Such charges are automatically dropped at
>> rmdir() if there are no tasks.
>>
>> -5. TODO
>> +5. Virtual address space accounting
>> +
>> +A new resource counter controls the address space expansion of the tasks in
>> +the cgroup. Address space control is provided along the same lines as
>> +RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2).
>> +The interface for controlling address space is provided through
>> +"as_limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t the user
>
> w.r.t.
> or even spelled out.
>

Will spell out.

>> +interface. Please see section 3 for more details on how to use the user
>> +interface to get and set values.
>> +
>> +The "as_usage_in_bytes" file provides information about the total address
>> +space usage of the cgroup in bytes.
>> +
>> +5.1 Advantages of providing this feature
>> +
>> +1. Control over virtual address space allows for a cgroup to fail gracefully
>> + i.e, via a malloc or mmap failure as compared to OOM kill when no
>
> i.e.,
>
>> + pages can be reclaimed
>
> end with period.

Will fix

>
>> +2. It provides better control over how many pages can be swapped out when
>> + the cgroup goes over it's limit. A badly setup cgroup can cause excessive
>
> its (not "it is")
>

Will fix :)

>> + swapping. Providing control over the address space allocations ensures
>> + that the system administrator has control over the total swapping that
>> + can take place.
>> +
>> +6. TODO
>>
>> 1. Add support for accounting huge pages (as a separate controller)
>> 2. Make per-cgroup scanner reclaim not-shared pages first
>> _
>
> ---
> ~Randy
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-17 01:47:57

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC][0/3] Virtual address space control for cgroups

Paul Menage wrote:
> On Mon, Mar 17, 2008 at 1:29 AM, Balbir Singh <[email protected]> wrote:
>> This is an early patchset for virtual address space control for cgroups.
>> The patches are against 2.6.25-rc5-mm1 and have been tested on top of
>> User Mode Linux.
>
> What's the performance hit of doing these accounting checks on every
> mmap/munmap? If it's not totally lost in the noise, couldn't it be
> made a separate control group, so that it could be just enabled (and
> the performance hit taken) for users that actually want it?
>

It will be code duplication to make it a new subsystem, and it will be useful
to control both of them, am I right? :)

So could we just add a CONFIG to this patch series, like:
CONFIG_CGROUP_MEM_RES_AS_CTLR
?

> Paul
>
>

2008-03-17 01:52:07

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][0/3] Virtual address space control for cgroups

Paul Menage wrote:
> On Mon, Mar 17, 2008 at 1:29 AM, Balbir Singh <[email protected]> wrote:
>> This is an early patchset for virtual address space control for cgroups.
>> The patches are against 2.6.25-rc5-mm1 and have been tested on top of
>> User Mode Linux.
>
> What's the performance hit of doing these accounting checks on every
> mmap/munmap? If it's not totally lost in the noise, couldn't it be
> made a separate control group, so that it could be just enabled (and
> the performance hit taken) for users that actually want it?
>

I am yet to measure the performance overhead of the accounting checks. I'll try
and get started on that today. I did not consider making it a separate system,
because I suspect that anybody wanting memory control would also want address
space control (for the advantages listed in the documentation). I am not against
the idea of making it a separate subsystem, but first let me get back with the
numbers.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-17 01:56:20

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][0/3] Virtual address space control for cgroups

On Mon, Mar 17, 2008 at 9:50 AM, Balbir Singh <[email protected]> wrote:
>
> I am yet to measure the performance overhead of the accounting checks. I'll try
> and get started on that today. I did not consider making it a separate system,
> because I suspect that anybody wanting memory control would also want address
> space control (for the advantages listed in the documentation).

I'm a counter-example to your suspicion :-)

Trying to control virtual address space is a complete nightmare in the
presence of anything that uses large sparsely-populated mappings
(mmaps of large files, or large sparse heaps such as the JVM uses.)

If we want to control the effect of swapping, the right way to do it
is to control disk I/O, and ensure that the swapping is accounted to
that. Or simply just not give apps much swap space.

Paul

2008-03-17 01:57:58

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][0/3] Virtual address space control for cgroups

On Mon, Mar 17, 2008 at 9:47 AM, Li Zefan <[email protected]> wrote:
>
> It will be code duplication to make it a new subsystem,

Would it? Other than the basic cgroup boilerplate, the only real
duplication that I could see would be that there'd need to be an
additional per-mm pointer back to the cgroup. (Which could be avoided
if we added a single per-mm pointer back to the "owning" task, which
would generally be the mm's thread group leader, so that you could go
quickly from an mm to a set of cgroup subsystems).

And the advantage would that you'd be able to more easily pick/choose
which bits of control you use (and pay for).

Paul

2008-03-17 02:04:44

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

On Mon, Mar 17, 2008 at 1:30 AM, Balbir Singh <[email protected]> wrote:
> /*
> + * Check if the current cgroup exceeds its address space limit.
> + * Returns 0 on success and 1 on failure.
> + */
> +int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
> +{
> + int ret = 0;
> + struct mem_cgroup *mem;
> + if (mem_cgroup_subsys.disabled)
> + return ret;
> +
> + rcu_read_lock();
> + mem = rcu_dereference(mm->mem_cgroup);
> + css_get(&mem->css);
> + rcu_read_unlock();
> +

How about if this function avoided charging the root cgroup? You'd
save 4 atomic operations on a global data structure on every
mmap/munmap when the virtual address limit cgroup wasn't in use, which
could be significant on a large system. And I don't see situations
where you really need to limit the address space of the root cgroup.

Paul

2008-03-17 02:58:57

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

Paul Menage wrote:
> On Mon, Mar 17, 2008 at 1:30 AM, Balbir Singh <[email protected]> wrote:
>> /*
>> + * Check if the current cgroup exceeds its address space limit.
>> + * Returns 0 on success and 1 on failure.
>> + */
>> +int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
>> +{
>> + int ret = 0;
>> + struct mem_cgroup *mem;
>> + if (mem_cgroup_subsys.disabled)
>> + return ret;
>> +
>> + rcu_read_lock();
>> + mem = rcu_dereference(mm->mem_cgroup);
>> + css_get(&mem->css);
>> + rcu_read_unlock();
>> +
>
> How about if this function avoided charging the root cgroup? You'd
> save 4 atomic operations on a global data structure on every
> mmap/munmap when the virtual address limit cgroup wasn't in use, which
> could be significant on a large system. And I don't see situations
> where you really need to limit the address space of the root cgroup.

4 atomic operations is very tempting, but we want to account for root usage due
to the following reasons:

1. We want to be able to support hierarchial accounting and control
2. We want to track usage of the root cgroup and report it back to the user
3. We don't want to treat the root cgroup as a special case.



--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-17 03:04:06

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

On Mon, Mar 17, 2008 at 10:57 AM, Balbir Singh
<[email protected]> wrote:
>
> 1. We want to be able to support hierarchial accounting and control

> 2. We want to track usage of the root cgroup and report it back to the user

What use cases do you have for that?

> 3. We don't want to treat the root cgroup as a special case.

Why? It is a special case, in that in a lot of machines there's only
going to be the root cgroup, and the subsystem won't be mounted. So in
those cases, paying any overhead is a cost without a benefit.

Alternatively, how about you skip tracking virtual address space
changes if the virtual address cgroup isn't mounted on any hierarchy?
When you mount it, you can do a pass across all mms and set the root
cgroup usage to their total.

Paul

2008-03-17 03:13:49

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][0/3] Virtual address space control for cgroups

Paul Menage wrote:
> On Mon, Mar 17, 2008 at 9:50 AM, Balbir Singh <[email protected]> wrote:
>> I am yet to measure the performance overhead of the accounting checks. I'll try
>> and get started on that today. I did not consider making it a separate system,
>> because I suspect that anybody wanting memory control would also want address
>> space control (for the advantages listed in the documentation).
>
> I'm a counter-example to your suspicion :-)
>
> Trying to control virtual address space is a complete nightmare in the
> presence of anything that uses large sparsely-populated mappings
> (mmaps of large files, or large sparse heaps such as the JVM uses.)
>

Not really. Virtual limits are more gentle than an OOM kill that can occur if
the cgroup runs out of memory. Please also see
http://linux-vserver.org/Memory_Limits

> If we want to control the effect of swapping, the right way to do it
> is to control disk I/O, and ensure that the swapping is accounted to
> that. Or simply just not give apps much swap space.

Yes, a disk I/O and swap I/O controller are being developed (not by us, but
others in the community). How does one restrict swap space for a particular
application? I can think of RLIMIT_AS for a process and something similar to
what I've posted for cgroups. Not enabling swap is an option, but not very
practical IMHO.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-17 05:10:48

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][0/3] Virtual address space control for cgroups

Paul Menage wrote:
> On Mon, Mar 17, 2008 at 9:47 AM, Li Zefan <[email protected]> wrote:
>> It will be code duplication to make it a new subsystem,
>
> Would it? Other than the basic cgroup boilerplate, the only real
> duplication that I could see would be that there'd need to be an
> additional per-mm pointer back to the cgroup. (Which could be avoided
> if we added a single per-mm pointer back to the "owning" task, which
> would generally be the mm's thread group leader, so that you could go
> quickly from an mm to a set of cgroup subsystems).
>

I understand the per-mm pointer overhead back to the cgroup. I don't understand
the part about adding a per-mm pointer back to the "owning" task. We already
have task->mm. BTW, the reason by we directly add the mm_struct to mem_cgroup
mapping is that there are contexts from where only the mm_struct is known (when
we charge/uncharge). Assuming that current->mm's mem_cgorup is the one we want
to charge/uncharge is incorrect.

> And the advantage would that you'd be able to more easily pick/choose
> which bits of control you use (and pay for).

I am not sure I understand your proposal fully. But, if it can help provide the
flexibility you are referring to, I am all ears.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-17 05:23:09

by Paul Menage

[permalink] [raw]
Subject: Re: [RFC][0/3] Virtual address space control for cgroups

On Mon, Mar 17, 2008 at 1:08 PM, Balbir Singh <[email protected]> wrote:
>
> I understand the per-mm pointer overhead back to the cgroup. I don't understand
> the part about adding a per-mm pointer back to the "owning" task. We already
> have task->mm.

Yes, but we don't have mm->owner, which is what I was proposing -
mm->owner would be a pointer typically to the mm's thread group
leader. It would remove the need to have to have pointers for the
various different cgroup subsystems that need to act on an mm rather
than a task_struct, since then you could use
mm->owner->cgroups[subsys_id].

But this is kind of orthogonal to whether virtual address space limits
should be a separate cgroup subsystem.

Paul

2008-03-17 11:54:49

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

[snip]

> +int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
> +{
> + int ret = 0;
> + struct mem_cgroup *mem;
> + if (mem_cgroup_subsys.disabled)
> + return ret;
> +
> + rcu_read_lock();
> + mem = rcu_dereference(mm->mem_cgroup);
> + css_get(&mem->css);
> + rcu_read_unlock();
> +
> + if (nr_pages > 0) {
> + if (res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE)))
> + ret = 1;
> + } else
> + res_counter_uncharge(&mem->as_res, (-nr_pages * PAGE_SIZE));

No, please, no. Let's make two calls - mem_cgroup_charge_as and mem_cgroup_uncharge_as.

[snip]

> @@ -1117,6 +1117,9 @@ munmap_back:
> }
> }
>
> + if (mem_cgroup_update_as(mm, len >> PAGE_SHIFT))
> + return -ENOMEM;
> +

Why not use existintg cap_vm_enough_memory and co?

[snip]

2008-03-17 12:30:55

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

Pavel Emelyanov wrote:
> [snip]
>
>> +int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
>> +{
>> + int ret = 0;
>> + struct mem_cgroup *mem;
>> + if (mem_cgroup_subsys.disabled)
>> + return ret;
>> +
>> + rcu_read_lock();
>> + mem = rcu_dereference(mm->mem_cgroup);
>> + css_get(&mem->css);
>> + rcu_read_unlock();
>> +
>> + if (nr_pages > 0) {
>> + if (res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE)))
>> + ret = 1;
>> + } else
>> + res_counter_uncharge(&mem->as_res, (-nr_pages * PAGE_SIZE));
>
> No, please, no. Let's make two calls - mem_cgroup_charge_as and mem_cgroup_uncharge_as.
>
> [snip]
>

Yes, sure :)

>> @@ -1117,6 +1117,9 @@ munmap_back:
>> }
>> }
>>
>> + if (mem_cgroup_update_as(mm, len >> PAGE_SHIFT))
>> + return -ENOMEM;
>> +
>
> Why not use existintg cap_vm_enough_memory and co?
>

I thought about it and almost used may_expand_vm(), but there is a slight catch
there. With cap_vm_enough_memory() or security_vm_enough_memory(), they are
called after total_vm has been calculated. In our case we need to keep the
cgroups equivalent of total_vm up to date, and we do this in mem_cgorup_update_as.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-17 12:41:56

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

Balbir Singh wrote:
> Pavel Emelyanov wrote:
>> [snip]
>>
>>> +int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
>>> +{
>>> + int ret = 0;
>>> + struct mem_cgroup *mem;
>>> + if (mem_cgroup_subsys.disabled)
>>> + return ret;
>>> +
>>> + rcu_read_lock();
>>> + mem = rcu_dereference(mm->mem_cgroup);
>>> + css_get(&mem->css);
>>> + rcu_read_unlock();
>>> +
>>> + if (nr_pages > 0) {
>>> + if (res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE)))
>>> + ret = 1;
>>> + } else
>>> + res_counter_uncharge(&mem->as_res, (-nr_pages * PAGE_SIZE));
>> No, please, no. Let's make two calls - mem_cgroup_charge_as and mem_cgroup_uncharge_as.
>>
>> [snip]
>>
>
> Yes, sure :)

Thanks :)

>>> @@ -1117,6 +1117,9 @@ munmap_back:
>>> }
>>> }
>>>
>>> + if (mem_cgroup_update_as(mm, len >> PAGE_SHIFT))
>>> + return -ENOMEM;
>>> +
>> Why not use existintg cap_vm_enough_memory and co?
>>
>
> I thought about it and almost used may_expand_vm(), but there is a slight catch
> there. With cap_vm_enough_memory() or security_vm_enough_memory(), they are
> called after total_vm has been calculated. In our case we need to keep the
> cgroups equivalent of total_vm up to date, and we do this in mem_cgorup_update_as.

So? What prevents us from using these hooks? :)

2008-03-17 12:53:31

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

Pavel Emelyanov wrote:
> Balbir Singh wrote:
>> Pavel Emelyanov wrote:
>>> [snip]
>>>
>>>> +int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
>>>> +{
>>>> + int ret = 0;
>>>> + struct mem_cgroup *mem;
>>>> + if (mem_cgroup_subsys.disabled)
>>>> + return ret;
>>>> +
>>>> + rcu_read_lock();
>>>> + mem = rcu_dereference(mm->mem_cgroup);
>>>> + css_get(&mem->css);
>>>> + rcu_read_unlock();
>>>> +
>>>> + if (nr_pages > 0) {
>>>> + if (res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE)))
>>>> + ret = 1;
>>>> + } else
>>>> + res_counter_uncharge(&mem->as_res, (-nr_pages * PAGE_SIZE));
>>> No, please, no. Let's make two calls - mem_cgroup_charge_as and mem_cgroup_uncharge_as.
>>>
>>> [snip]
>>>
>> Yes, sure :)
>
> Thanks :)
>
>>>> @@ -1117,6 +1117,9 @@ munmap_back:
>>>> }
>>>> }
>>>>
>>>> + if (mem_cgroup_update_as(mm, len >> PAGE_SHIFT))
>>>> + return -ENOMEM;
>>>> +
>>> Why not use existintg cap_vm_enough_memory and co?
>>>
>> I thought about it and almost used may_expand_vm(), but there is a slight catch
>> there. With cap_vm_enough_memory() or security_vm_enough_memory(), they are
>> called after total_vm has been calculated. In our case we need to keep the
>> cgroups equivalent of total_vm up to date, and we do this in mem_cgorup_update_as.
>
> So? What prevents us from using these hooks? :)

1. We need to account total_vm usage of the task anyway. So why have two places,
one for accounting and second for control?
2. These hooks are activated for conditionally invoked for vma's with VM_ACCOUNT
set.


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-17 13:03:21

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

Balbir Singh wrote:
> Pavel Emelyanov wrote:
>> Balbir Singh wrote:
>>> Pavel Emelyanov wrote:
>>>> [snip]
>>>>
>>>>> +int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
>>>>> +{
>>>>> + int ret = 0;
>>>>> + struct mem_cgroup *mem;
>>>>> + if (mem_cgroup_subsys.disabled)
>>>>> + return ret;
>>>>> +
>>>>> + rcu_read_lock();
>>>>> + mem = rcu_dereference(mm->mem_cgroup);
>>>>> + css_get(&mem->css);
>>>>> + rcu_read_unlock();
>>>>> +
>>>>> + if (nr_pages > 0) {
>>>>> + if (res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE)))
>>>>> + ret = 1;
>>>>> + } else
>>>>> + res_counter_uncharge(&mem->as_res, (-nr_pages * PAGE_SIZE));
>>>> No, please, no. Let's make two calls - mem_cgroup_charge_as and mem_cgroup_uncharge_as.
>>>>
>>>> [snip]
>>>>
>>> Yes, sure :)
>> Thanks :)
>>
>>>>> @@ -1117,6 +1117,9 @@ munmap_back:
>>>>> }
>>>>> }
>>>>>
>>>>> + if (mem_cgroup_update_as(mm, len >> PAGE_SHIFT))
>>>>> + return -ENOMEM;
>>>>> +
>>>> Why not use existintg cap_vm_enough_memory and co?
>>>>
>>> I thought about it and almost used may_expand_vm(), but there is a slight catch
>>> there. With cap_vm_enough_memory() or security_vm_enough_memory(), they are
>>> called after total_vm has been calculated. In our case we need to keep the
>>> cgroups equivalent of total_vm up to date, and we do this in mem_cgorup_update_as.
>> So? What prevents us from using these hooks? :)
>
> 1. We need to account total_vm usage of the task anyway. So why have two places,
> one for accounting and second for control?

We still have two of them even placing hooks in each place manually.

Besides, putting the mem_cgroup_(un)charge_as() in these vm hooks will
1. save the number of places to patch
2. help keeping memcgroup consistent in case someone adds more places
that expand tasks vm (arches, drivers) - in case we have our hooks
celled from inside vm ones, we won't have to patch more.

> 2. These hooks are activated for conditionally invoked for vma's with VM_ACCOUNT
> set.

This is a good point against. But, wrt my previous comment, can we handle
this somehow?

2008-03-17 14:41:23

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

Pavel Emelyanov wrote:
> Balbir Singh wrote:
>> Pavel Emelyanov wrote:
>>> Balbir Singh wrote:
>>>> Pavel Emelyanov wrote:
>>>>> [snip]
>>>>>
>>>>>> +int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
>>>>>> +{
>>>>>> + int ret = 0;
>>>>>> + struct mem_cgroup *mem;
>>>>>> + if (mem_cgroup_subsys.disabled)
>>>>>> + return ret;
>>>>>> +
>>>>>> + rcu_read_lock();
>>>>>> + mem = rcu_dereference(mm->mem_cgroup);
>>>>>> + css_get(&mem->css);
>>>>>> + rcu_read_unlock();
>>>>>> +
>>>>>> + if (nr_pages > 0) {
>>>>>> + if (res_counter_charge(&mem->as_res, (nr_pages * PAGE_SIZE)))
>>>>>> + ret = 1;
>>>>>> + } else
>>>>>> + res_counter_uncharge(&mem->as_res, (-nr_pages * PAGE_SIZE));
>>>>> No, please, no. Let's make two calls - mem_cgroup_charge_as and mem_cgroup_uncharge_as.
>>>>>
>>>>> [snip]
>>>>>
>>>> Yes, sure :)
>>> Thanks :)
>>>
>>>>>> @@ -1117,6 +1117,9 @@ munmap_back:
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> + if (mem_cgroup_update_as(mm, len >> PAGE_SHIFT))
>>>>>> + return -ENOMEM;
>>>>>> +
>>>>> Why not use existintg cap_vm_enough_memory and co?
>>>>>
>>>> I thought about it and almost used may_expand_vm(), but there is a slight catch
>>>> there. With cap_vm_enough_memory() or security_vm_enough_memory(), they are
>>>> called after total_vm has been calculated. In our case we need to keep the
>>>> cgroups equivalent of total_vm up to date, and we do this in mem_cgorup_update_as.
>>> So? What prevents us from using these hooks? :)
>> 1. We need to account total_vm usage of the task anyway. So why have two places,
>> one for accounting and second for control?
>
> We still have two of them even placing hooks in each place manually.
>
> Besides, putting the mem_cgroup_(un)charge_as() in these vm hooks will
> 1. save the number of places to patch
> 2. help keeping memcgroup consistent in case someone adds more places
> that expand tasks vm (arches, drivers) - in case we have our hooks
> celled from inside vm ones, we won't have to patch more.
>

I am not sure I understand your proposal. Without manually placing these hooks
how do we track

1. When the vm size has increased/decreased
2. In case due to some reason, the call following these hooks fail, how do we
undo it, without placing hooks?


>> 2. These hooks are activated for conditionally invoked for vma's with VM_ACCOUNT
>> set.
>
> This is a good point against. But, wrt my previous comment, can we handle
> this somehow?

Not sure I understand

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-17 15:17:30

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][0/3] Virtual address space control for cgroups

Paul Menage wrote:
> On Mon, Mar 17, 2008 at 1:08 PM, Balbir Singh <[email protected]> wrote:
>> I understand the per-mm pointer overhead back to the cgroup. I don't understand
>> the part about adding a per-mm pointer back to the "owning" task. We already
>> have task->mm.
>
> Yes, but we don't have mm->owner, which is what I was proposing -
> mm->owner would be a pointer typically to the mm's thread group
> leader. It would remove the need to have to have pointers for the
> various different cgroup subsystems that need to act on an mm rather
> than a task_struct, since then you could use
> mm->owner->cgroups[subsys_id].
>

Aaahh.. Yes.. mm->owner might be a good idea. The only thing we'll need to
handle is when mm->owner dies (I think the thread group is still kept around).
The other disadvantage is the double dereferencing, which should not be all that
bad.

> But this is kind of orthogonal to whether virtual address space limits
> should be a separate cgroup subsystem.
>

Yes, sure.


--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-17 16:53:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

On Sun, 2008-03-16 at 23:00 +0530, Balbir Singh wrote:
> @@ -787,6 +788,8 @@ static int ptrace_bts_realloc(struct tas
> current->mm->total_vm -= old_size;
> current->mm->locked_vm -= old_size;
>
> + mem_cgroup_update_as(current->mm, -old_size);
> +
> if (size == 0)
> goto out;

I think splattering these things all over is probably a bad idea.

If you're going to do this, I think you need a couple of phases.

1. update the vm_(un)acct_memory() functions to take an mm
2. start using them (or some other abstracted functions in place)
3. update the new functions for cgroups

It's a bit non-obvious why you do the mem_cgroup_update_as() calls in
the places that you do from context.

Having some other vm-abstracted functions will also keep you from
splattering mem_cgroup_update_as() across the tree. That's a pretty bad
name. :) ...update_mapped() or ...update_vm() might be a wee bit
better.

-- Dave

2008-03-17 23:36:01

by YAMAMOTO Takashi

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

> diff -puN mm/swapfile.c~memory-controller-virtual-address-space-accounting-and-control mm/swapfile.c
> diff -puN mm/memory.c~memory-controller-virtual-address-space-accounting-and-control mm/memory.c
> --- linux-2.6.25-rc5/mm/memory.c~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
> +++ linux-2.6.25-rc5-balbir/mm/memory.c 2008-03-16 22:57:40.000000000 +0530
> @@ -838,6 +838,11 @@ unsigned long unmap_vmas(struct mmu_gath
>
> if (vma->vm_flags & VM_ACCOUNT)
> *nr_accounted += (end - start) >> PAGE_SHIFT;
> + /*
> + * Unaccount used virtual memory for cgroups
> + */
> + mem_cgroup_update_as(vma->vm_mm,
> + ((long)(start - end)) >> PAGE_SHIFT);
>
> while (start != end) {
> if (!tlb_start_valid) {

i think you can sum and uncharge it with a single call.

YAMAMOTO Takashi

2008-03-18 01:12:46

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

YAMAMOTO Takashi wrote:
>> diff -puN mm/swapfile.c~memory-controller-virtual-address-space-accounting-and-control mm/swapfile.c
>> diff -puN mm/memory.c~memory-controller-virtual-address-space-accounting-and-control mm/memory.c
>> --- linux-2.6.25-rc5/mm/memory.c~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
>> +++ linux-2.6.25-rc5-balbir/mm/memory.c 2008-03-16 22:57:40.000000000 +0530
>> @@ -838,6 +838,11 @@ unsigned long unmap_vmas(struct mmu_gath
>>
>> if (vma->vm_flags & VM_ACCOUNT)
>> *nr_accounted += (end - start) >> PAGE_SHIFT;
>> + /*
>> + * Unaccount used virtual memory for cgroups
>> + */
>> + mem_cgroup_update_as(vma->vm_mm,
>> + ((long)(start - end)) >> PAGE_SHIFT);
>>
>> while (start != end) {
>> if (!tlb_start_valid) {
>
> i think you can sum and uncharge it with a single call.
>

Like nr_accounted? I'll have to duplicate nr_accounted since that depends
conditionally on VM_ACCOUNT.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-18 01:17:53

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

Dave Hansen wrote:
> On Sun, 2008-03-16 at 23:00 +0530, Balbir Singh wrote:
>> @@ -787,6 +788,8 @@ static int ptrace_bts_realloc(struct tas
>> current->mm->total_vm -= old_size;
>> current->mm->locked_vm -= old_size;
>>
>> + mem_cgroup_update_as(current->mm, -old_size);
>> +
>> if (size == 0)
>> goto out;
>
> I think splattering these things all over is probably a bad idea.
>

I agree and I tried to avoid the splattering

> If you're going to do this, I think you need a couple of phases.
>
> 1. update the vm_(un)acct_memory() functions to take an mm

There are other problems

1. vm_(un)acct_memory is conditionally dependent on VM_ACCOUNT. Look at
shmem_(un)acct_size for example
2. These routines are not called from all contexts that we care about (look at
insert_special_mapping())

> 2. start using them (or some other abstracted functions in place)
> 3. update the new functions for cgroups
>
> It's a bit non-obvious why you do the mem_cgroup_update_as() calls in
> the places that you do from context.
>
> Having some other vm-abstracted functions will also keep you from
> splattering mem_cgroup_update_as() across the tree. That's a pretty bad
> name. :) ...update_mapped() or ...update_vm() might be a wee bit
> better.
>

I am going to split mem_cgroup_update_as() to two routines with a better name. I
agree with you in principle about splattering, but please see my comments above

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-03-19 19:33:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][2/3] Account and control virtual address space allocations

On Tue, 2008-03-18 at 06:44 +0530, Balbir Singh wrote:
> > If you're going to do this, I think you need a couple of phases.
> >
> > 1. update the vm_(un)acct_memory() functions to take an mm
>
> There are other problems
>
> 1. vm_(un)acct_memory is conditionally dependent on VM_ACCOUNT. Look at
> shmem_(un)acct_size for example

Yeah, but if VM_ACCOUNT isn't set, do you really want the controller
accounting for them? It's there for a reason. :)

The shmem_acct_size() helpers look good. I wonder if we should be using
that kind of things more generically.

> 2. These routines are not called from all contexts that we care about (look at
> insert_special_mapping())

Could you explain why "we" care about it and why it isn't accounted for
now?

-- Dave