2014-07-03 12:48:44

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

Hi,

Typically, when a process calls mmap, it isn't given all the memory pages it
requested immediately. Instead, only its address space is grown, while the
memory pages will be actually allocated on the first use. If the system fails
to allocate a page, it will have no choice except invoking the OOM killer,
which may kill this or any other process. Obviously, it isn't the best way of
telling the user that the system is unable to handle his request. It would be
much better to fail mmap with ENOMEM instead.

That's why Linux has the memory overcommit control feature, which accounts and
limits VM size that may contribute to mem+swap, i.e. private writable mappings
and shared memory areas. However, currently it's only available system-wide,
and there's no way of avoiding OOM in cgroups.

This patch set is an attempt to fill the gap. It implements the resource
controller for cgroups that accounts and limits address space allocations that
may contribute to mem+swap.

The interface is similar to the one of the memory cgroup except it controls
virtual memory usage, not actual memory allocation:

vm.usage_in_bytes current vm usage of processes inside cgroup
(read-only)

vm.max_usage_in_bytes max vm.usage_in_bytes, can be reset by writing 0

vm.limit_in_bytes vm.usage_in_bytes must be <= vm.limite_in_bytes;
allocations that hit the limit will be failed
with ENOMEM

vm.failcnt number of times the limit was hit, can be reset
by writing 0

In future, the controller can be easily extended to account for locked pages
and shmem.

Note, for the sake of simplicity, task migrations and mm->owner changes are not
handled yet. I'm planning to fix this in the next version if the need in this
cgroup is confirmed.

It isn't the first attempt to introduce VM accounting per cgroup. Several years
ago Balbir Singh almost pushed his memrlimit cgroup, but it was finally shelved
(see http://lwn.net/Articles/283287/). Balbir's cgroup has one principal
difference from the vm cgroup I'm presenting here: it limited the sum of
mm->total_vm of tasks inside a cgroup, i.e. it worked like an RLIMIT_AS, but
for the whole cgroup. IMO, it isn't very useful, because shared memory areas
are accounted more than once, which can lead to failing mmap even if there's
plenty of free memory and OOM is impossible.

Any comments are highly appreciated.

Thanks,

Vladimir Davydov (5):
vm_cgroup: basic infrastructure
vm_cgroup: private writable mappings accounting
shmem: pass inode to shmem_acct_* methods
vm_cgroup: shared memory accounting
vm_cgroup: do not charge tasks in root cgroup

include/linux/cgroup_subsys.h | 4 +
include/linux/mm_types.h | 3 +
include/linux/shmem_fs.h | 6 +
include/linux/vm_cgroup.h | 79 +++++++++++++
init/Kconfig | 4 +
kernel/fork.c | 12 +-
mm/Makefile | 1 +
mm/mmap.c | 43 ++++++--
mm/mprotect.c | 8 +-
mm/mremap.c | 15 ++-
mm/shmem.c | 94 +++++++++++-----
mm/vm_cgroup.c | 244 +++++++++++++++++++++++++++++++++++++++++
12 files changed, 471 insertions(+), 42 deletions(-)
create mode 100644 include/linux/vm_cgroup.h
create mode 100644 mm/vm_cgroup.c

--
1.7.10.4


2014-07-03 12:49:08

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH RFC 1/5] vm_cgroup: basic infrastructure

This patch introduces the vm cgroup to control address space expansion
of tasks that belong to a cgroup. The idea is to provide a mechanism to
limit memory overcommit not only for the whole system, but also on per
cgroup basis.

This patch only adds some basic cgroup methods, like alloc/free and
write/read, while the real accounting/limiting is done in the following
patches.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/cgroup_subsys.h | 4 ++
include/linux/vm_cgroup.h | 18 ++++++
init/Kconfig | 4 ++
mm/Makefile | 1 +
mm/vm_cgroup.c | 131 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 158 insertions(+)
create mode 100644 include/linux/vm_cgroup.h
create mode 100644 mm/vm_cgroup.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 98c4f9b12b03..8eb7db12f6ea 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
SUBSYS(hugetlb)
#endif

+#if IS_ENABLED(CONFIG_CGROUP_VM)
+SUBSYS(vm)
+#endif
+
/*
* The following subsystems are not supported on the default hierarchy.
*/
diff --git a/include/linux/vm_cgroup.h b/include/linux/vm_cgroup.h
new file mode 100644
index 000000000000..b629c9affa4b
--- /dev/null
+++ b/include/linux/vm_cgroup.h
@@ -0,0 +1,18 @@
+#ifndef _LINUX_VM_CGROUP_H
+#define _LINUX_VM_CGROUP_H
+
+#ifdef CONFIG_CGROUP_VM
+static inline bool vm_cgroup_disabled(void)
+{
+ if (vm_cgrp_subsys.disabled)
+ return true;
+ return false;
+}
+#else /* !CONFIG_CGROUP_VM */
+static inline bool vm_cgroup_disabled(void)
+{
+ return true;
+}
+#endif /* CONFIG_CGROUP_VM */
+
+#endif /* _LINUX_VM_CGROUP_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99af1b9..4419835bea7c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1008,6 +1008,10 @@ config MEMCG_KMEM
unusable in real life so DO NOT SELECT IT unless for development
purposes.

+config CGROUP_VM
+ bool "Virtual Memory Resource Controller for Control Groups"
+ default n
+
config CGROUP_HUGETLB
bool "HugeTLB Resource Controller for Control Groups"
depends on RESOURCE_COUNTERS && HUGETLB_PAGE
diff --git a/mm/Makefile b/mm/Makefile
index 4064f3ec145e..914520d2669f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o vmpressure.o
+obj-$(CONFIG_CGROUP_VM) += vm_cgroup.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
diff --git a/mm/vm_cgroup.c b/mm/vm_cgroup.c
new file mode 100644
index 000000000000..7f5b81482748
--- /dev/null
+++ b/mm/vm_cgroup.c
@@ -0,0 +1,131 @@
+#include <linux/cgroup.h>
+#include <linux/res_counter.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/vm_cgroup.h>
+
+struct vm_cgroup {
+ struct cgroup_subsys_state css;
+
+ /*
+ * The counter to account for vm usage.
+ */
+ struct res_counter res;
+};
+
+static struct vm_cgroup *root_vm_cgroup __read_mostly;
+
+static inline bool vm_cgroup_is_root(struct vm_cgroup *vmcg)
+{
+ return vmcg == root_vm_cgroup;
+}
+
+static struct vm_cgroup *vm_cgroup_from_css(struct cgroup_subsys_state *s)
+{
+ return s ? container_of(s, struct vm_cgroup, css) : NULL;
+}
+
+static struct cgroup_subsys_state *
+vm_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct vm_cgroup *parent = vm_cgroup_from_css(parent_css);
+ struct vm_cgroup *vmcg;
+
+ vmcg = kzalloc(sizeof(*vmcg), GFP_KERNEL);
+ if (!vmcg)
+ return ERR_PTR(-ENOMEM);
+
+ res_counter_init(&vmcg->res, parent ? &parent->res : NULL);
+
+ if (!parent)
+ root_vm_cgroup = vmcg;
+
+ return &vmcg->css;
+}
+
+static void vm_cgroup_css_free(struct cgroup_subsys_state *css)
+{
+ struct vm_cgroup *vmcg = vm_cgroup_from_css(css);
+
+ kfree(vmcg);
+}
+
+static u64 vm_cgroup_read_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct vm_cgroup *vmcg = vm_cgroup_from_css(css);
+ int memb = cft->private;
+
+ return res_counter_read_u64(&vmcg->res, memb);
+}
+
+static ssize_t vm_cgroup_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct vm_cgroup *vmcg = vm_cgroup_from_css(of_css(of));
+ unsigned long long val;
+ int ret;
+
+ if (vm_cgroup_is_root(vmcg))
+ return -EINVAL;
+
+ buf = strstrip(buf);
+ ret = res_counter_memparse_write_strategy(buf, &val);
+ if (ret)
+ return ret;
+
+ ret = res_counter_set_limit(&vmcg->res, val);
+ return ret ?: nbytes;
+}
+
+static ssize_t vm_cgroup_reset(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct vm_cgroup *vmcg= vm_cgroup_from_css(of_css(of));
+ int memb = of_cft(of)->private;
+
+ switch (memb) {
+ case RES_MAX_USAGE:
+ res_counter_reset_max(&vmcg->res);
+ break;
+ case RES_FAILCNT:
+ res_counter_reset_failcnt(&vmcg->res);
+ break;
+ default:
+ BUG();
+ }
+ return nbytes;
+}
+
+static struct cftype vm_cgroup_files[] = {
+ {
+ .name = "usage_in_bytes",
+ .private = RES_USAGE,
+ .read_u64 = vm_cgroup_read_u64,
+ },
+ {
+ .name = "max_usage_in_bytes",
+ .private = RES_MAX_USAGE,
+ .write = vm_cgroup_reset,
+ .read_u64 = vm_cgroup_read_u64,
+ },
+ {
+ .name = "limit_in_bytes",
+ .private = RES_LIMIT,
+ .write = vm_cgroup_write,
+ .read_u64 = vm_cgroup_read_u64,
+ },
+ {
+ .name = "failcnt",
+ .private = RES_FAILCNT,
+ .write = vm_cgroup_reset,
+ .read_u64 = vm_cgroup_read_u64,
+ },
+ { }, /* terminate */
+};
+
+struct cgroup_subsys vm_cgrp_subsys = {
+ .css_alloc = vm_cgroup_css_alloc,
+ .css_free = vm_cgroup_css_free,
+ .base_cftypes = vm_cgroup_files,
+};
--
1.7.10.4

2014-07-03 12:49:20

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH RFC 2/5] vm_cgroup: private writable mappings accounting

Address space that contributes to memory overcommit consists of two
parts - private writable mappings and shared memory. This patch adds
private writable mappings accounting.

The implementation is quite simple. Each mm holds a reference to the vm
cgroup it is accounted to. The reference is initialized with the current
cgroup on mm creation and released only on mm destruction. For
simplicity, task migrations as well as mm owner changes are not handled
yet, so an offline cgroup will be pinned in memory until all mm's
accounted to it die.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/mm_types.h | 3 ++
include/linux/vm_cgroup.h | 29 ++++++++++++++++++++
kernel/fork.c | 12 +++++++-
mm/mmap.c | 43 +++++++++++++++++++++++------
mm/mprotect.c | 8 +++++-
mm/mremap.c | 15 ++++++++--
mm/vm_cgroup.c | 67 +++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 164 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 96c5750e3110..ae6c23524b8a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -419,6 +419,9 @@ struct mm_struct {
*/
struct task_struct __rcu *owner;
#endif
+#ifdef CONFIG_CGROUP_VM
+ struct vm_cgroup *vmcg; /* vm_cgroup this mm is accounted to */
+#endif

/* store ref to file /proc/<pid>/exe symlink points to */
struct file *exe_file;
diff --git a/include/linux/vm_cgroup.h b/include/linux/vm_cgroup.h
index b629c9affa4b..34ed936a0a10 100644
--- a/include/linux/vm_cgroup.h
+++ b/include/linux/vm_cgroup.h
@@ -1,6 +1,8 @@
#ifndef _LINUX_VM_CGROUP_H
#define _LINUX_VM_CGROUP_H

+struct mm_struct;
+
#ifdef CONFIG_CGROUP_VM
static inline bool vm_cgroup_disabled(void)
{
@@ -8,11 +10,38 @@ static inline bool vm_cgroup_disabled(void)
return true;
return false;
}
+
+extern void mm_init_vm_cgroup(struct mm_struct *mm, struct task_struct *p);
+extern void mm_release_vm_cgroup(struct mm_struct *mm);
+extern int vm_cgroup_charge_memory_mm(struct mm_struct *mm,
+ unsigned long nr_pages);
+extern void vm_cgroup_uncharge_memory_mm(struct mm_struct *mm,
+ unsigned long nr_pages);
#else /* !CONFIG_CGROUP_VM */
static inline bool vm_cgroup_disabled(void)
{
return true;
}
+
+static inline void mm_init_vm_cgroup(struct mm_struct *mm,
+ struct task_struct *p)
+{
+}
+
+static inline void mm_release_vm_cgroup(struct mm_struct *mm)
+{
+}
+
+static inline int vm_cgroup_charge_memory_mm(struct mm_struct *mm,
+ unsigned long nr_pages)
+{
+ return 0;
+}
+
+static inline void vm_cgroup_uncharge_memory_mm(struct mm_struct *mm,
+ unsigned long nr_pages)
+{
+}
#endif /* CONFIG_CGROUP_VM */

#endif /* _LINUX_VM_CGROUP_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index d2799d1fc952..8f96553f9fde 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -74,6 +74,7 @@
#include <linux/uprobes.h>
#include <linux/aio.h>
#include <linux/compiler.h>
+#include <linux/vm_cgroup.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -394,8 +395,13 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
if (mpnt->vm_flags & VM_ACCOUNT) {
unsigned long len = vma_pages(mpnt);

- if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
+ if (vm_cgroup_charge_memory_mm(mm, len))
goto fail_nomem;
+
+ if (security_vm_enough_memory_mm(oldmm, len)) {
+ vm_cgroup_uncharge_memory_mm(mm, len);
+ goto fail_nomem;
+ }
charge = len;
}
tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
@@ -479,6 +485,7 @@ fail_nomem_policy:
fail_nomem:
retval = -ENOMEM;
vm_unacct_memory(charge);
+ vm_cgroup_uncharge_memory_mm(mm, charge);
goto out;
}

@@ -551,6 +558,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)

if (likely(!mm_alloc_pgd(mm))) {
mmu_notifier_mm_init(mm);
+ mm_init_vm_cgroup(mm, current);
return mm;
}

@@ -599,6 +607,7 @@ struct mm_struct *mm_alloc(void)
void __mmdrop(struct mm_struct *mm)
{
BUG_ON(mm == &init_mm);
+ mm_release_vm_cgroup(mm);
mm_free_pgd(mm);
destroy_context(mm);
mmu_notifier_mm_destroy(mm);
@@ -857,6 +866,7 @@ fail_nocontext:
* If init_new_context() failed, we cannot use mmput() to free the mm
* because it calls destroy_context()
*/
+ mm_release_vm_cgroup(mm);
mm_free_pgd(mm);
free_mm(mm);
return NULL;
diff --git a/mm/mmap.c b/mm/mmap.c
index 129b847d30cc..9ba9e932e132 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -40,6 +40,7 @@
#include <linux/notifier.h>
#include <linux/memory.h>
#include <linux/printk.h>
+#include <linux/vm_cgroup.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -1535,8 +1536,12 @@ munmap_back:
*/
if (accountable_mapping(file, vm_flags)) {
charged = len >> PAGE_SHIFT;
- if (security_vm_enough_memory_mm(mm, charged))
+ if (vm_cgroup_charge_memory_mm(mm, charged))
return -ENOMEM;
+ if (security_vm_enough_memory_mm(mm, charged)) {
+ vm_cgroup_uncharge_memory_mm(mm, charged);
+ return -ENOMEM;
+ }
vm_flags |= VM_ACCOUNT;
}

@@ -1652,8 +1657,10 @@ unmap_and_free_vma:
free_vma:
kmem_cache_free(vm_area_cachep, vma);
unacct_error:
- if (charged)
+ if (charged) {
vm_unacct_memory(charged);
+ vm_cgroup_uncharge_memory_mm(mm, charged);
+ }
return error;
}

@@ -2084,12 +2091,16 @@ static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, uns
if (is_hugepage_only_range(vma->vm_mm, new_start, size))
return -EFAULT;

+ if (vm_cgroup_charge_memory_mm(mm, grow))
+ return -ENOMEM;
/*
* Overcommit.. This must be the final test, as it will
* update security statistics.
*/
- if (security_vm_enough_memory_mm(mm, grow))
+ if (security_vm_enough_memory_mm(mm, grow)) {
+ vm_cgroup_uncharge_memory_mm(mm, grow);
return -ENOMEM;
+ }

/* Ok, everything looks good - let it rip */
if (vma->vm_flags & VM_LOCKED)
@@ -2341,6 +2352,7 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
vma = remove_vma(vma);
} while (vma);
vm_unacct_memory(nr_accounted);
+ vm_cgroup_uncharge_memory_mm(mm, nr_accounted);
validate_mm(mm);
}

@@ -2603,6 +2615,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
unsigned long flags;
struct rb_node ** rb_link, * rb_parent;
pgoff_t pgoff = addr >> PAGE_SHIFT;
+ unsigned long charged;
int error;

len = PAGE_ALIGN(len);
@@ -2642,8 +2655,13 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;

- if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
+ charged = len >> PAGE_SHIFT;
+ if (vm_cgroup_charge_memory_mm(mm, charged))
+ return -ENOMEM;
+ if (security_vm_enough_memory_mm(mm, charged)) {
+ vm_cgroup_uncharge_memory_mm(mm, charged);
return -ENOMEM;
+ }

/* Can we just expand an old private anonymous mapping? */
vma = vma_merge(mm, prev, addr, addr + len, flags,
@@ -2656,7 +2674,8 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
*/
vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
if (!vma) {
- vm_unacct_memory(len >> PAGE_SHIFT);
+ vm_unacct_memory(charged);
+ vm_cgroup_uncharge_memory_mm(mm, charged);
return -ENOMEM;
}

@@ -2738,6 +2757,7 @@ void exit_mmap(struct mm_struct *mm)
vma = remove_vma(vma);
}
vm_unacct_memory(nr_accounted);
+ vm_cgroup_uncharge_memory_mm(mm, nr_accounted);

WARN_ON(atomic_long_read(&mm->nr_ptes) >
(FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
@@ -2771,9 +2791,16 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
if (find_vma_links(mm, vma->vm_start, vma->vm_end,
&prev, &rb_link, &rb_parent))
return -ENOMEM;
- if ((vma->vm_flags & VM_ACCOUNT) &&
- security_vm_enough_memory_mm(mm, vma_pages(vma)))
- return -ENOMEM;
+ if ((vma->vm_flags & VM_ACCOUNT)) {
+ unsigned long charged = vma_pages(vma);
+
+ if (vm_cgroup_charge_memory_mm(mm, charged))
+ return -ENOMEM;
+ if (security_vm_enough_memory_mm(mm, charged)) {
+ vm_cgroup_uncharge_memory_mm(mm, charged);
+ return -ENOMEM;
+ }
+ }

vma_link(mm, vma, prev, rb_link, rb_parent);
return 0;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557941f8..f76d1cadb3c1 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -24,6 +24,7 @@
#include <linux/migrate.h>
#include <linux/perf_event.h>
#include <linux/ksm.h>
+#include <linux/vm_cgroup.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/cacheflush.h>
@@ -283,8 +284,12 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
VM_SHARED|VM_NORESERVE))) {
charged = nrpages;
- if (security_vm_enough_memory_mm(mm, charged))
+ if (vm_cgroup_charge_memory_mm(mm, charged))
return -ENOMEM;
+ if (security_vm_enough_memory_mm(mm, charged)) {
+ vm_cgroup_uncharge_memory_mm(mm, charged);
+ return -ENOMEM;
+ }
newflags |= VM_ACCOUNT;
}
}
@@ -338,6 +343,7 @@ success:

fail:
vm_unacct_memory(charged);
+ vm_cgroup_uncharge_memory_mm(mm, charged);
return error;
}

diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180e9f21..1cf5709acce5 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -21,6 +21,7 @@
#include <linux/syscalls.h>
#include <linux/mmu_notifier.h>
#include <linux/sched/sysctl.h>
+#include <linux/vm_cgroup.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -313,6 +314,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
if (do_munmap(mm, old_addr, old_len) < 0) {
/* OOM: unable to split vma, just get accounts right */
vm_unacct_memory(excess >> PAGE_SHIFT);
+ vm_cgroup_uncharge_memory_mm(mm, excess >> PAGE_SHIFT);
excess = 0;
}
mm->hiwater_vm = hiwater_vm;
@@ -374,8 +376,13 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,

if (vma->vm_flags & VM_ACCOUNT) {
unsigned long charged = (new_len - old_len) >> PAGE_SHIFT;
- if (security_vm_enough_memory_mm(mm, charged))
+
+ if (vm_cgroup_charge_memory_mm(mm, charged))
+ goto Efault;
+ if (security_vm_enough_memory_mm(mm, charged)) {
+ vm_cgroup_uncharge_memory_mm(mm, charged);
goto Efault;
+ }
*p = charged;
}

@@ -447,7 +454,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
goto out;
out1:
vm_unacct_memory(charged);
-
+ vm_cgroup_uncharge_memory_mm(mm, charged);
out:
return ret;
}
@@ -578,8 +585,10 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
}
out:
- if (ret & ~PAGE_MASK)
+ if (ret & ~PAGE_MASK) {
vm_unacct_memory(charged);
+ vm_cgroup_uncharge_memory_mm(mm, charged);
+ }
up_write(&current->mm->mmap_sem);
if (locked && new_len > old_len)
mm_populate(new_addr + old_len, new_len - old_len);
diff --git a/mm/vm_cgroup.c b/mm/vm_cgroup.c
index 7f5b81482748..4dd693b34e33 100644
--- a/mm/vm_cgroup.c
+++ b/mm/vm_cgroup.c
@@ -2,6 +2,7 @@
#include <linux/res_counter.h>
#include <linux/mm.h>
#include <linux/slab.h>
+#include <linux/rcupdate.h>
#include <linux/vm_cgroup.h>

struct vm_cgroup {
@@ -25,6 +26,72 @@ static struct vm_cgroup *vm_cgroup_from_css(struct cgroup_subsys_state *s)
return s ? container_of(s, struct vm_cgroup, css) : NULL;
}

+static struct vm_cgroup *vm_cgroup_from_task(struct task_struct *p)
+{
+ return vm_cgroup_from_css(task_css(p, vm_cgrp_id));
+}
+
+static struct vm_cgroup *get_vm_cgroup_from_task(struct task_struct *p)
+{
+ struct vm_cgroup *vmcg;
+
+ rcu_read_lock();
+ do {
+ vmcg = vm_cgroup_from_task(p);
+ } while (!css_tryget_online(&vmcg->css));
+ rcu_read_unlock();
+
+ return vmcg;
+}
+
+void mm_init_vm_cgroup(struct mm_struct *mm, struct task_struct *p)
+{
+ if (!vm_cgroup_disabled())
+ mm->vmcg = get_vm_cgroup_from_task(p);
+}
+
+void mm_release_vm_cgroup(struct mm_struct *mm)
+{
+ struct vm_cgroup *vmcg = mm->vmcg;
+
+ if (vmcg)
+ css_put(&vmcg->css);
+}
+
+static int vm_cgroup_do_charge(struct vm_cgroup *vmcg,
+ unsigned long nr_pages)
+{
+ unsigned long val = nr_pages << PAGE_SHIFT;
+ struct res_counter *fail_res;
+
+ return res_counter_charge(&vmcg->res, val, &fail_res);
+}
+
+static void vm_cgroup_do_uncharge(struct vm_cgroup *vmcg,
+ unsigned long nr_pages)
+{
+ unsigned long val = nr_pages << PAGE_SHIFT;
+
+ res_counter_uncharge(&vmcg->res, val);
+}
+
+int vm_cgroup_charge_memory_mm(struct mm_struct *mm, unsigned long nr_pages)
+{
+ struct vm_cgroup *vmcg = mm->vmcg;
+
+ if (vmcg)
+ return vm_cgroup_do_charge(vmcg, nr_pages);
+ return 0;
+}
+
+void vm_cgroup_uncharge_memory_mm(struct mm_struct *mm, unsigned long nr_pages)
+{
+ struct vm_cgroup *vmcg = mm->vmcg;
+
+ if (vmcg)
+ vm_cgroup_do_uncharge(vmcg, nr_pages);
+}
+
static struct cgroup_subsys_state *
vm_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
--
1.7.10.4

2014-07-03 12:49:27

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH RFC 3/5] shmem: pass inode to shmem_acct_* methods

This will be used by the next patch.

Signed-off-by: Vladimir Davydov <[email protected]>
---
mm/shmem.c | 59 ++++++++++++++++++++++++++++++++++-------------------------
1 file changed, 34 insertions(+), 25 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index f484c276e994..4e28eb1222cd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -136,16 +136,20 @@ static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
* (unless MAP_NORESERVE and sysctl_overcommit_memory <= 1),
* consistent with the pre-accounting of private mappings ...
*/
-static inline int shmem_acct_size(unsigned long flags, loff_t size)
+static inline int shmem_acct_size(struct inode *inode)
{
- return (flags & VM_NORESERVE) ?
- 0 : security_vm_enough_memory_mm(current->mm, VM_ACCT(size));
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ return (info->flags & VM_NORESERVE) ?
+ 0 : security_vm_enough_memory_mm(current->mm, VM_ACCT(inode->i_size));
}

-static inline void shmem_unacct_size(unsigned long flags, loff_t size)
+static inline void shmem_unacct_size(struct inode *inode)
{
- if (!(flags & VM_NORESERVE))
- vm_unacct_memory(VM_ACCT(size));
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (!(info->flags & VM_NORESERVE))
+ vm_unacct_memory(VM_ACCT(inode->i_size));
}

/*
@@ -154,15 +158,19 @@ static inline void shmem_unacct_size(unsigned long flags, loff_t size)
* shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
* so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
*/
-static inline int shmem_acct_block(unsigned long flags)
+static inline int shmem_acct_block(struct inode *inode)
{
- return (flags & VM_NORESERVE) ?
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ return (info->flags & VM_NORESERVE) ?
security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_CACHE_SIZE)) : 0;
}

-static inline void shmem_unacct_blocks(unsigned long flags, long pages)
+static inline void shmem_unacct_blocks(struct inode *inode, long pages)
{
- if (flags & VM_NORESERVE)
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (info->flags & VM_NORESERVE)
vm_unacct_memory(pages * VM_ACCT(PAGE_CACHE_SIZE));
}

@@ -231,7 +239,7 @@ static void shmem_recalc_inode(struct inode *inode)
percpu_counter_add(&sbinfo->used_blocks, -freed);
info->alloced -= freed;
inode->i_blocks -= freed * BLOCKS_PER_PAGE;
- shmem_unacct_blocks(info->flags, freed);
+ shmem_unacct_blocks(inode, freed);
}
}

@@ -565,7 +573,7 @@ static void shmem_evict_inode(struct inode *inode)
struct shmem_inode_info *info = SHMEM_I(inode);

if (inode->i_mapping->a_ops == &shmem_aops) {
- shmem_unacct_size(info->flags, inode->i_size);
+ shmem_unacct_size(inode);
inode->i_size = 0;
shmem_truncate_range(inode, 0, (loff_t)-1);
if (!list_empty(&info->swaplist)) {
@@ -1113,7 +1121,7 @@ repeat:
swap_free(swap);

} else {
- if (shmem_acct_block(info->flags)) {
+ if (shmem_acct_block(inode)) {
error = -ENOSPC;
goto failed;
}
@@ -1205,7 +1213,7 @@ decused:
if (sbinfo->max_blocks)
percpu_counter_add(&sbinfo->used_blocks, -1);
unacct:
- shmem_unacct_blocks(info->flags, 1);
+ shmem_unacct_blocks(inode, 1);
failed:
if (swap.val && error != -EINVAL &&
!shmem_confirm_swap(mapping, index, swap))
@@ -2810,8 +2818,8 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
#define shmem_vm_ops generic_file_vm_ops
#define shmem_file_operations ramfs_file_operations
#define shmem_get_inode(sb, dir, mode, dev, flags) ramfs_get_inode(sb, dir, mode, dev)
-#define shmem_acct_size(flags, size) 0
-#define shmem_unacct_size(flags, size) do {} while (0)
+#define shmem_acct_size(inode) 0
+#define shmem_unacct_size(inode) do {} while (0)

#endif /* CONFIG_SHMEM */

@@ -2836,17 +2844,13 @@ static struct file *__shmem_file_setup(const char *name, loff_t size,
if (size < 0 || size > MAX_LFS_FILESIZE)
return ERR_PTR(-EINVAL);

- if (shmem_acct_size(flags, size))
- return ERR_PTR(-ENOMEM);
-
- res = ERR_PTR(-ENOMEM);
this.name = name;
this.len = strlen(name);
this.hash = 0; /* will go */
sb = shm_mnt->mnt_sb;
path.dentry = d_alloc_pseudo(sb, &this);
if (!path.dentry)
- goto put_memory;
+ return ERR_PTR(-ENOMEM);
d_set_d_op(path.dentry, &anon_ops);
path.mnt = mntget(shm_mnt);

@@ -2859,21 +2863,26 @@ static struct file *__shmem_file_setup(const char *name, loff_t size,
d_instantiate(path.dentry, inode);
inode->i_size = size;
clear_nlink(inode); /* It is unlinked */
+
+ res = ERR_PTR(-ENOMEM);
+ if (shmem_acct_size(inode))
+ goto put_dentry;
+
res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size));
if (IS_ERR(res))
- goto put_dentry;
+ goto put_memory;

res = alloc_file(&path, FMODE_WRITE | FMODE_READ,
&shmem_file_operations);
if (IS_ERR(res))
- goto put_dentry;
+ goto put_memory;

return res;

+put_memory:
+ shmem_unacct_size(inode);
put_dentry:
path_put(&path);
-put_memory:
- shmem_unacct_size(flags, size);
return res;
}

--
1.7.10.4

2014-07-03 12:49:36

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH RFC 4/5] vm_cgroup: shared memory accounting

Address space that contributes to memory overcommit consists of two
parts - private writable mappings and shared memory. This patch adds
shared memory accounting.

Each shmem inode holds a reference to the vm cgroup it is accounted to.
The reference is initialized with the current cgroup on shmem inode
creation and released only on shmem inode destruction. For simplicity,
shmem inodes accounted to a vm cgroup are not re-charged to the parent
on css offline yet, so offline cgroups will be hanging in memory until
all inodes accounted to it die.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/shmem_fs.h | 6 ++++++
include/linux/vm_cgroup.h | 32 ++++++++++++++++++++++++++++++++
mm/shmem.c | 39 +++++++++++++++++++++++++++++++++------
mm/vm_cgroup.c | 36 ++++++++++++++++++++++++++++++++++++
4 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 4d1771c2d29f..b87f8b35ad40 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -7,6 +7,8 @@
#include <linux/percpu_counter.h>
#include <linux/xattr.h>

+struct vm_cgroup;
+
/* inode in-kernel data */

struct shmem_inode_info {
@@ -21,6 +23,10 @@ struct shmem_inode_info {
struct list_head swaplist; /* chain of maybes on swap */
struct simple_xattrs xattrs; /* list of xattrs */
struct inode vfs_inode;
+#ifdef CONFIG_CGROUP_VM
+ struct vm_cgroup *vmcg; /* vm_cgroup this inode is
+ accounted to */
+#endif
};

struct shmem_sb_info {
diff --git a/include/linux/vm_cgroup.h b/include/linux/vm_cgroup.h
index 34ed936a0a10..582f051091bd 100644
--- a/include/linux/vm_cgroup.h
+++ b/include/linux/vm_cgroup.h
@@ -2,6 +2,7 @@
#define _LINUX_VM_CGROUP_H

struct mm_struct;
+struct shmem_inode_info;

#ifdef CONFIG_CGROUP_VM
static inline bool vm_cgroup_disabled(void)
@@ -17,6 +18,16 @@ extern int vm_cgroup_charge_memory_mm(struct mm_struct *mm,
unsigned long nr_pages);
extern void vm_cgroup_uncharge_memory_mm(struct mm_struct *mm,
unsigned long nr_pages);
+
+#ifdef CONFIG_SHMEM
+extern void shmem_init_vm_cgroup(struct shmem_inode_info *info);
+extern void shmem_release_vm_cgroup(struct shmem_inode_info *info);
+extern int vm_cgroup_charge_shmem(struct shmem_inode_info *info,
+ unsigned long nr_pages);
+extern void vm_cgroup_uncharge_shmem(struct shmem_inode_info *info,
+ unsigned long nr_pages);
+#endif
+
#else /* !CONFIG_CGROUP_VM */
static inline bool vm_cgroup_disabled(void)
{
@@ -42,6 +53,27 @@ static inline void vm_cgroup_uncharge_memory_mm(struct mm_struct *mm,
unsigned long nr_pages)
{
}
+
+#ifdef CONFIG_SHMEM
+static inline void shmem_init_vm_cgroup(struct shmem_inode_info *info)
+{
+}
+
+static inline void shmem_release_vm_cgroup(struct shmem_inode_info *info)
+{
+}
+
+static inline int vm_cgroup_charge_shmem(struct shmem_inode_info *info,
+ unsigned long nr_pages)
+{
+}
+
+static inline void vm_cgroup_uncharge_shmem(struct shmem_inode_info *info,
+ unsigned long nr_pages)
+{
+}
+#endif
+
#endif /* CONFIG_CGROUP_VM */

#endif /* _LINUX_VM_CGROUP_H */
diff --git a/mm/shmem.c b/mm/shmem.c
index 4e28eb1222cd..3968cdf1d254 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
#include <linux/highmem.h>
#include <linux/seq_file.h>
#include <linux/magic.h>
+#include <linux/vm_cgroup.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -130,6 +131,27 @@ static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
return sb->s_fs_info;
}

+static inline int __shmem_acct_memory(struct shmem_inode_info *info,
+ long pages)
+{
+ int ret;
+
+ ret = vm_cgroup_charge_shmem(info, pages);
+ if (ret)
+ return ret;
+ ret = security_vm_enough_memory_mm(current->mm, pages);
+ if (ret)
+ vm_cgroup_uncharge_shmem(info, pages);
+ return ret;
+}
+
+static inline void __shmem_unacct_memory(struct shmem_inode_info *info,
+ long pages)
+{
+ vm_unacct_memory(pages);
+ vm_cgroup_uncharge_shmem(info, pages);
+}
+
/*
* shmem_file_setup pre-accounts the whole fixed size of a VM object,
* for shared memory and for shared anonymous (/dev/zero) mappings
@@ -141,7 +163,7 @@ static inline int shmem_acct_size(struct inode *inode)
struct shmem_inode_info *info = SHMEM_I(inode);

return (info->flags & VM_NORESERVE) ?
- 0 : security_vm_enough_memory_mm(current->mm, VM_ACCT(inode->i_size));
+ 0 : __shmem_acct_memory(info, VM_ACCT(inode->i_size));
}

static inline void shmem_unacct_size(struct inode *inode)
@@ -149,7 +171,7 @@ static inline void shmem_unacct_size(struct inode *inode)
struct shmem_inode_info *info = SHMEM_I(inode);

if (!(info->flags & VM_NORESERVE))
- vm_unacct_memory(VM_ACCT(inode->i_size));
+ __shmem_unacct_memory(info, VM_ACCT(inode->i_size));
}

/*
@@ -163,7 +185,7 @@ static inline int shmem_acct_block(struct inode *inode)
struct shmem_inode_info *info = SHMEM_I(inode);

return (info->flags & VM_NORESERVE) ?
- security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_CACHE_SIZE)) : 0;
+ __shmem_acct_memory(info, VM_ACCT(PAGE_CACHE_SIZE)) : 0;
}

static inline void shmem_unacct_blocks(struct inode *inode, long pages)
@@ -171,7 +193,7 @@ static inline void shmem_unacct_blocks(struct inode *inode, long pages)
struct shmem_inode_info *info = SHMEM_I(inode);

if (info->flags & VM_NORESERVE)
- vm_unacct_memory(pages * VM_ACCT(PAGE_CACHE_SIZE));
+ __shmem_unacct_memory(info, pages * VM_ACCT(PAGE_CACHE_SIZE));
}

static const struct super_operations shmem_ops;
@@ -1339,6 +1361,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
inode->i_fop = &shmem_file_operations;
mpol_shared_policy_init(&info->policy,
shmem_get_sbmpol(sbinfo));
+ shmem_init_vm_cgroup(info);
break;
case S_IFDIR:
inc_nlink(inode);
@@ -2590,8 +2613,12 @@ static void shmem_destroy_callback(struct rcu_head *head)

static void shmem_destroy_inode(struct inode *inode)
{
- if (S_ISREG(inode->i_mode))
- mpol_free_shared_policy(&SHMEM_I(inode)->policy);
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if (S_ISREG(inode->i_mode)) {
+ mpol_free_shared_policy(&info->policy);
+ shmem_release_vm_cgroup(info);
+ }
call_rcu(&inode->i_rcu, shmem_destroy_callback);
}

diff --git a/mm/vm_cgroup.c b/mm/vm_cgroup.c
index 4dd693b34e33..6642f934540a 100644
--- a/mm/vm_cgroup.c
+++ b/mm/vm_cgroup.c
@@ -3,6 +3,7 @@
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/rcupdate.h>
+#include <linux/shmem_fs.h>
#include <linux/vm_cgroup.h>

struct vm_cgroup {
@@ -92,6 +93,41 @@ void vm_cgroup_uncharge_memory_mm(struct mm_struct *mm, unsigned long nr_pages)
vm_cgroup_do_uncharge(vmcg, nr_pages);
}

+#ifdef CONFIG_SHMEM
+void shmem_init_vm_cgroup(struct shmem_inode_info *info)
+{
+ if (!vm_cgroup_disabled())
+ info->vmcg = get_vm_cgroup_from_task(current);
+}
+
+void shmem_release_vm_cgroup(struct shmem_inode_info *info)
+{
+ struct vm_cgroup *vmcg = info->vmcg;
+
+ if (vmcg)
+ css_put(&vmcg->css);
+}
+
+int vm_cgroup_charge_shmem(struct shmem_inode_info *info,
+ unsigned long nr_pages)
+{
+ struct vm_cgroup *vmcg = info->vmcg;
+
+ if (vmcg)
+ return vm_cgroup_do_charge(vmcg, nr_pages);
+ return 0;
+}
+
+void vm_cgroup_uncharge_shmem(struct shmem_inode_info *info,
+ unsigned long nr_pages)
+{
+ struct vm_cgroup *vmcg = info->vmcg;
+
+ if (vmcg)
+ vm_cgroup_do_uncharge(info->vmcg, nr_pages);
+}
+#endif /* CONFIG_SHMEM */
+
static struct cgroup_subsys_state *
vm_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
{
--
1.7.10.4

2014-07-03 12:49:50

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH RFC 5/5] vm_cgroup: do not charge tasks in root cgroup

For the root cgroup (the whole system), we already have overcommit
accounting and control, so we can skip charging tasks in the root cgroup
to avoid overhead.

Signed-off-by: Vladimir Davydov <[email protected]>
---
mm/vm_cgroup.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/vm_cgroup.c b/mm/vm_cgroup.c
index 6642f934540a..c871fecaab4c 100644
--- a/mm/vm_cgroup.c
+++ b/mm/vm_cgroup.c
@@ -1,6 +1,7 @@
#include <linux/cgroup.h>
#include <linux/res_counter.h>
#include <linux/mm.h>
+#include <linux/mman.h>
#include <linux/slab.h>
#include <linux/rcupdate.h>
#include <linux/shmem_fs.h>
@@ -65,6 +66,9 @@ static int vm_cgroup_do_charge(struct vm_cgroup *vmcg,
unsigned long val = nr_pages << PAGE_SHIFT;
struct res_counter *fail_res;

+ if (vm_cgroup_is_root(vmcg))
+ return 0;
+
return res_counter_charge(&vmcg->res, val, &fail_res);
}

@@ -73,6 +77,9 @@ static void vm_cgroup_do_uncharge(struct vm_cgroup *vmcg,
{
unsigned long val = nr_pages << PAGE_SHIFT;

+ if (vm_cgroup_is_root(vmcg))
+ return;
+
res_counter_uncharge(&vmcg->res, val);
}

@@ -159,6 +166,9 @@ static u64 vm_cgroup_read_u64(struct cgroup_subsys_state *css,
struct vm_cgroup *vmcg = vm_cgroup_from_css(css);
int memb = cft->private;

+ if (vm_cgroup_is_root(vmcg))
+ return vm_memory_committed() << PAGE_SHIFT;
+
return res_counter_read_u64(&vmcg->res, memb);
}

--
1.7.10.4

2014-07-04 12:16:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

On Thu 03-07-14 16:48:16, Vladimir Davydov wrote:
> Hi,
>
> Typically, when a process calls mmap, it isn't given all the memory pages it
> requested immediately. Instead, only its address space is grown, while the
> memory pages will be actually allocated on the first use. If the system fails
> to allocate a page, it will have no choice except invoking the OOM killer,
> which may kill this or any other process. Obviously, it isn't the best way of
> telling the user that the system is unable to handle his request. It would be
> much better to fail mmap with ENOMEM instead.
>
> That's why Linux has the memory overcommit control feature, which accounts and
> limits VM size that may contribute to mem+swap, i.e. private writable mappings
> and shared memory areas. However, currently it's only available system-wide,
> and there's no way of avoiding OOM in cgroups.
>
> This patch set is an attempt to fill the gap. It implements the resource
> controller for cgroups that accounts and limits address space allocations that
> may contribute to mem+swap.

Well, I am not really sure how helpful is this. Could you be more
specific about real use cases? If the only problem is that memcg OOM can
trigger to easily then I do not think this is the right approach to
handle it. Strict no-overcommit is basically unusable for many
workloads. Especially those which try to do their own memory usage
optimization in a much larger address space.

Once I get from internal things (which will happen soon hopefully) I
will post a series with a new sets of memcg limits. One of them is
high_limit which can be used as a trigger for memcg reclaim. Unlike
hard_limit there won't be any OOM if the reclaim fails at this stage. So
if the high_limit is configured properly the admin will have enough time
to make additional steps before OOM happens.
[...]
--
Michal Hocko
SUSE Labs

2014-07-04 15:39:17

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

Hi Michal,

On Fri, Jul 04, 2014 at 02:16:21PM +0200, Michal Hocko wrote:
> On Thu 03-07-14 16:48:16, Vladimir Davydov wrote:
> > Hi,
> >
> > Typically, when a process calls mmap, it isn't given all the memory pages it
> > requested immediately. Instead, only its address space is grown, while the
> > memory pages will be actually allocated on the first use. If the system fails
> > to allocate a page, it will have no choice except invoking the OOM killer,
> > which may kill this or any other process. Obviously, it isn't the best way of
> > telling the user that the system is unable to handle his request. It would be
> > much better to fail mmap with ENOMEM instead.
> >
> > That's why Linux has the memory overcommit control feature, which accounts and
> > limits VM size that may contribute to mem+swap, i.e. private writable mappings
> > and shared memory areas. However, currently it's only available system-wide,
> > and there's no way of avoiding OOM in cgroups.
> >
> > This patch set is an attempt to fill the gap. It implements the resource
> > controller for cgroups that accounts and limits address space allocations that
> > may contribute to mem+swap.
>
> Well, I am not really sure how helpful is this. Could you be more
> specific about real use cases? If the only problem is that memcg OOM can
> trigger to easily then I do not think this is the right approach to
> handle it.

The problem is that an application inside a container is currently given
no hints on how much memory it may actually consume. It can mmap a huge
area and eventually find itself killed or swapped out after using
several percent of it. This can be painful sometimes. Let me give an
example.

Suppose a user wants to run some computational workload, which may take
several days. He doesn't exactly know how much memory it will consume,
so he decides to start with buying a 1G container for it. He then starts
the workload in the container and sees it's working fine for some time.
So he decides he guessed the container size right and now only has to
wait for a day or two. Suppose the workload actually wants 10G. Or it
can consume up to 100G and has some weird logic to determine how much
memory the system may give it, e.g. trying to mmap as much as possible.
Suppose the server the container is running on has 1000G. The workload
won't fail immediately then. It will be allowed to consume 1G, which may
take quite long, but finally it will either fail with OOM or become
really sluggish due to swap out. The user will probably be frustrated to
see his workload failed when he comes back in a day or two, because it
will cost him money and time. This wouldn't happen if there were the VM
limit, which stopped the application immediately at start giving the
user a hint that something is going wrong and he needs to either tune
his application (e.g. setting -Xmsn for java) or buy a bigger container.

You can argue that the container may have a kind of meminfo
virtualization and any sane application must go and check it, but (1)
not all applications do that (some may try mmap-until-failure
heuristic), (2) there may be several unrelated processes inside CT, each
checking that there are pretty of free mem according to meminfo, mmaping
it and failing later, (3) it may be an application container, which
doesn't have proc mounted.

I guess that's why most distributions have overcommit limited by default
(vm.overcommit_memory!=2).

> Strict no-overcommit is basically unusable for many workloads.
> Especially those which try to do their own memory usage optimization
> in a much larger address space.

Sure, 'no-overcommit' is definitely unusable, but we can set it to e.g.
twice memcg limit. This will allow to overcommit memory to some extent,
but fail for really large allocations that can never be served.

> Once I get from internal things (which will happen soon hopefully) I
> will post a series with a new sets of memcg limits. One of them is
> high_limit which can be used as a trigger for memcg reclaim. Unlike
> hard_limit there won't be any OOM if the reclaim fails at this stage. So
> if the high_limit is configured properly the admin will have enough time
> to make additional steps before OOM happens.

High/low limits that start reclaim on internal/external pressure are
definitely a very nice feature (may be even more useful that strict
limits). However, they won't help us against overcommit inside a
container. AFAIC,

- low limit will allow the container to consume as much as he wants
until it triggers global memory pressure, then it will be shrunk back
to its limit aggressively;

- high limit means allow to breach the limit, but trigger reclaim
asynchronously (a kind of kswapd) or synchronously when it happens.

Right?

Considering the example I've given above, both of these won't help if
the system has other active CTs: the container will be forcefully kept
around its high/low limit and, since it's definitely not enough for it,
it will be finally killed crossing out the computations it's spent so
much time on. High limit won't be good for the container even if there's
no other load on the node - it will be constantly swapping out anon
memory and evicting file caches. The application won't die quickly then,
but it will get a heavy slowdown, which is no better than killing I
guess.

Also, I guess it'd be beneficial to have

- mlocked pages accounting per cgroup, because they affect memory
reclaim, and how low/high limits work, so it'd be nice to have them
limited to a sane value;

- shmem areas accounting per cgroup, because the total amount of shmem
on the system is limited, and it'll be no good if malicious
containers eat it all.

IMO It wouldn't be a good idea to overwhelm memcg with those limits, the
VM controller suits much better.

Thanks.

2014-07-09 07:53:11

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

On Thu, Jul 03, 2014 at 04:48:16PM +0400, Vladimir Davydov wrote:
> Hi,
>
> Typically, when a process calls mmap, it isn't given all the memory pages it
> requested immediately. Instead, only its address space is grown, while the
> memory pages will be actually allocated on the first use. If the system fails
> to allocate a page, it will have no choice except invoking the OOM killer,
> which may kill this or any other process. Obviously, it isn't the best way of
> telling the user that the system is unable to handle his request. It would be
> much better to fail mmap with ENOMEM instead.
>
> That's why Linux has the memory overcommit control feature, which accounts and
> limits VM size that may contribute to mem+swap, i.e. private writable mappings
> and shared memory areas. However, currently it's only available system-wide,
> and there's no way of avoiding OOM in cgroups.
>
> This patch set is an attempt to fill the gap. It implements the resource
> controller for cgroups that accounts and limits address space allocations that
> may contribute to mem+swap.
>
> The interface is similar to the one of the memory cgroup except it controls
> virtual memory usage, not actual memory allocation:
>
> vm.usage_in_bytes current vm usage of processes inside cgroup
> (read-only)
>
> vm.max_usage_in_bytes max vm.usage_in_bytes, can be reset by writing 0
>
> vm.limit_in_bytes vm.usage_in_bytes must be <= vm.limite_in_bytes;
> allocations that hit the limit will be failed
> with ENOMEM
>
> vm.failcnt number of times the limit was hit, can be reset
> by writing 0
>
> In future, the controller can be easily extended to account for locked pages
> and shmem.

Any thoughts on this?

Thanks.

2014-07-09 15:09:52

by Tim Hockin

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

How is this different from RLIMIT_AS? You specifically mentioned it
earlier but you don't explain how this is different.

>From my perspective, this is pointless. There's plenty of perfectly
correct software that mmaps files without concern for VSIZE, because
they never fault most of those pages in. From my observations it is
not generally possible to predict an average VSIZE limit that would
satisfy your concerns *and* not kill lots of valid apps.

It sounds like what you want is to limit or even disable swap usage.
Given your example, your hypothetical user would probably be better of
getting an OOM kill early so she can fix her job spec to request more
memory.

On Wed, Jul 9, 2014 at 12:52 AM, Vladimir Davydov
<[email protected]> wrote:
> On Thu, Jul 03, 2014 at 04:48:16PM +0400, Vladimir Davydov wrote:
>> Hi,
>>
>> Typically, when a process calls mmap, it isn't given all the memory pages it
>> requested immediately. Instead, only its address space is grown, while the
>> memory pages will be actually allocated on the first use. If the system fails
>> to allocate a page, it will have no choice except invoking the OOM killer,
>> which may kill this or any other process. Obviously, it isn't the best way of
>> telling the user that the system is unable to handle his request. It would be
>> much better to fail mmap with ENOMEM instead.
>>
>> That's why Linux has the memory overcommit control feature, which accounts and
>> limits VM size that may contribute to mem+swap, i.e. private writable mappings
>> and shared memory areas. However, currently it's only available system-wide,
>> and there's no way of avoiding OOM in cgroups.
>>
>> This patch set is an attempt to fill the gap. It implements the resource
>> controller for cgroups that accounts and limits address space allocations that
>> may contribute to mem+swap.
>>
>> The interface is similar to the one of the memory cgroup except it controls
>> virtual memory usage, not actual memory allocation:
>>
>> vm.usage_in_bytes current vm usage of processes inside cgroup
>> (read-only)
>>
>> vm.max_usage_in_bytes max vm.usage_in_bytes, can be reset by writing 0
>>
>> vm.limit_in_bytes vm.usage_in_bytes must be <= vm.limite_in_bytes;
>> allocations that hit the limit will be failed
>> with ENOMEM
>>
>> vm.failcnt number of times the limit was hit, can be reset
>> by writing 0
>>
>> In future, the controller can be easily extended to account for locked pages
>> and shmem.
>
> Any thoughts on this?
>
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-07-09 16:36:56

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

Hi Tim,

On Wed, Jul 09, 2014 at 08:08:07AM -0700, Tim Hockin wrote:
> How is this different from RLIMIT_AS? You specifically mentioned it
> earlier but you don't explain how this is different.

The main difference is that RLIMIT_AS is per process while this
controller is per cgroup. RLIMIT_AS doesn't allow us to limit VSIZE for
a group of unrelated or cooperating through shmem processes.

Also RLIMIT_AS accounts for total VM usage (including file mappings),
while this only charges private writable and shared mappings, whose
faulted-in pages always occupy mem+swap and therefore cannot be just
synced and dropped like file pages. In other words, this controller
works exactly as the global overcommit control.

> From my perspective, this is pointless. There's plenty of perfectly
> correct software that mmaps files without concern for VSIZE, because
> they never fault most of those pages in.

But there's also software that correctly handles ENOMEM returned by
mmap. For example, mongodb keeps growing its buffers until mmap fails.
Therefore, if there's no overcommit control, it will be OOM-killed
sooner or later, which may be pretty annoying. And we did have customers
complaining about that.

> From my observations it is not generally possible to predict an
> average VSIZE limit that would satisfy your concerns *and* not kill
> lots of valid apps.

Yes, it's difficult. Actually, we can only guess. Nevertheless, we
predict and set the VSIZE limit system-wide by default.

> It sounds like what you want is to limit or even disable swap usage.

I want to avoid OOM kill if it's possible to return ENOMEM. OOM can be
painful. It can kill lots of innocent processes. Of course, the user can
protect some processes by setting oom_score_adj, but this is difficult
and requires time and expertise, so an average user won't do that.

> Given your example, your hypothetical user would probably be better of
> getting an OOM kill early so she can fix her job spec to request more
> memory.

In my example the user won't get OOM kill *early*...

Thanks.

2014-07-09 17:04:45

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

On Wed, Jul 9, 2014 at 9:36 AM, Vladimir Davydov <[email protected]> wrote:
> Hi Tim,
>
> On Wed, Jul 09, 2014 at 08:08:07AM -0700, Tim Hockin wrote:
>> How is this different from RLIMIT_AS? You specifically mentioned it
>> earlier but you don't explain how this is different.
>
> The main difference is that RLIMIT_AS is per process while this
> controller is per cgroup. RLIMIT_AS doesn't allow us to limit VSIZE for
> a group of unrelated or cooperating through shmem processes.
>
> Also RLIMIT_AS accounts for total VM usage (including file mappings),
> while this only charges private writable and shared mappings, whose
> faulted-in pages always occupy mem+swap and therefore cannot be just
> synced and dropped like file pages. In other words, this controller
> works exactly as the global overcommit control.
>
>> From my perspective, this is pointless. There's plenty of perfectly
>> correct software that mmaps files without concern for VSIZE, because
>> they never fault most of those pages in.
>
> But there's also software that correctly handles ENOMEM returned by
> mmap. For example, mongodb keeps growing its buffers until mmap fails.
> Therefore, if there's no overcommit control, it will be OOM-killed
> sooner or later, which may be pretty annoying. And we did have customers
> complaining about that.

Is mongodb's buffer growth causing the oom kills?

If yes, I wonder if apps, like mongodb, that want ENOMEM should (1)
use MAP_POPULATE and (2) we change vm_map_pgoff() to propagate
mm_populate() ENOMEM failures back to mmap()?

>> From my observations it is not generally possible to predict an
>> average VSIZE limit that would satisfy your concerns *and* not kill
>> lots of valid apps.
>
> Yes, it's difficult. Actually, we can only guess. Nevertheless, we
> predict and set the VSIZE limit system-wide by default.
>
>> It sounds like what you want is to limit or even disable swap usage.
>
> I want to avoid OOM kill if it's possible to return ENOMEM. OOM can be
> painful. It can kill lots of innocent processes. Of course, the user can
> protect some processes by setting oom_score_adj, but this is difficult
> and requires time and expertise, so an average user won't do that.
>
>> Given your example, your hypothetical user would probably be better of
>> getting an OOM kill early so she can fix her job spec to request more
>> memory.
>
> In my example the user won't get OOM kill *early*...

2014-07-10 16:36:10

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

Hi Greg,

On Wed, Jul 09, 2014 at 10:04:21AM -0700, Greg Thelen wrote:
> On Wed, Jul 9, 2014 at 9:36 AM, Vladimir Davydov <[email protected]> wrote:
> > Hi Tim,
> >
> > On Wed, Jul 09, 2014 at 08:08:07AM -0700, Tim Hockin wrote:
> >> How is this different from RLIMIT_AS? You specifically mentioned it
> >> earlier but you don't explain how this is different.
> >
> > The main difference is that RLIMIT_AS is per process while this
> > controller is per cgroup. RLIMIT_AS doesn't allow us to limit VSIZE for
> > a group of unrelated or cooperating through shmem processes.
> >
> > Also RLIMIT_AS accounts for total VM usage (including file mappings),
> > while this only charges private writable and shared mappings, whose
> > faulted-in pages always occupy mem+swap and therefore cannot be just
> > synced and dropped like file pages. In other words, this controller
> > works exactly as the global overcommit control.
> >
> >> From my perspective, this is pointless. There's plenty of perfectly
> >> correct software that mmaps files without concern for VSIZE, because
> >> they never fault most of those pages in.
> >
> > But there's also software that correctly handles ENOMEM returned by
> > mmap. For example, mongodb keeps growing its buffers until mmap fails.
> > Therefore, if there's no overcommit control, it will be OOM-killed
> > sooner or later, which may be pretty annoying. And we did have customers
> > complaining about that.
>
> Is mongodb's buffer growth causing the oom kills?

We saw this happened on our customer's node some time ago. A container
running mongodb and several other services got OOM-kills from time to
time, which made the customer unhappy. Limiting overcommit helped then.

> If yes, I wonder if apps, like mongodb, that want ENOMEM should (1)
> use MAP_POPULATE and (2) we change vm_map_pgoff() to propagate
> mm_populate() ENOMEM failures back to mmap()?

This way we may fault-in lots of pages, evicting someone's working set
along the way, only to get ENOMEM eventually. This doesn't look optimal.
Also, this requires modifications of userspace apps, which isn't always
possible.

Thanks.

2014-07-16 12:01:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

On Fri 04-07-14 19:38:53, Vladimir Davydov wrote:
> Hi Michal,
>
> On Fri, Jul 04, 2014 at 02:16:21PM +0200, Michal Hocko wrote:
[...]
> > Once I get from internal things (which will happen soon hopefully) I
> > will post a series with a new sets of memcg limits. One of them is
> > high_limit which can be used as a trigger for memcg reclaim. Unlike
> > hard_limit there won't be any OOM if the reclaim fails at this stage. So
> > if the high_limit is configured properly the admin will have enough time
> > to make additional steps before OOM happens.
>
> High/low limits that start reclaim on internal/external pressure are
> definitely a very nice feature (may be even more useful that strict
> limits). However, they won't help us against overcommit inside a
> container. AFAIC,
>
> - low limit will allow the container to consume as much as he wants
> until it triggers global memory pressure, then it will be shrunk back
> to its limit aggressively;

No, this is not like soft_limit. Any external pressure (e.g. coming from
some of the parents) will exclude memcgs which are below its low_limit.
If there is no way to proceed because all groups in the currently
reclaimed hierarchy are below its low limit then it will ignore the low
limit. So this is an optimistic working set protection.

> - high limit means allow to breach the limit, but trigger reclaim
> asynchronously (a kind of kswapd) or synchronously when it happens.

No, we will start with the direct reclaim as we do for the hard limit.
The only change wrt. hard limit is that we do not trigger OOM if the
reclaim fails.

> Right?
>
> Considering the example I've given above, both of these won't help if
> the system has other active CTs: the container will be forcefully kept
> around its high/low limit and, since it's definitely not enough for it,
> it will be finally killed crossing out the computations it's spent so
> much time on. High limit won't be good for the container even if there's
> no other load on the node - it will be constantly swapping out anon
> memory and evicting file caches. The application won't die quickly then,
> but it will get a heavy slowdown, which is no better than killing I
> guess.

It will get vmpressure notifications though and can help to release
excessive buffers which were allocated optimistically.

> Also, I guess it'd be beneficial to have
>
> - mlocked pages accounting per cgroup, because they affect memory
> reclaim, and how low/high limits work, so it'd be nice to have them
> limited to a sane value;
>
> - shmem areas accounting per cgroup, because the total amount of shmem
> on the system is limited, and it'll be no good if malicious
> containers eat it all.
>
> IMO It wouldn't be a good idea to overwhelm memcg with those limits, the
> VM controller suits much better.

yeah, I do not think adding more to memcg is a good idea. I am still not
sure whether working around bad design decisions in applications is a
good rationale for a new controller.

--
Michal Hocko
SUSE Labs

2014-07-23 14:08:59

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

On Wed, Jul 16, 2014 at 02:01:47PM +0200, Michal Hocko wrote:
> On Fri 04-07-14 19:38:53, Vladimir Davydov wrote:
> > Considering the example I've given above, both of these won't help if
> > the system has other active CTs: the container will be forcefully kept
> > around its high/low limit and, since it's definitely not enough for it,
> > it will be finally killed crossing out the computations it's spent so
> > much time on. High limit won't be good for the container even if there's
> > no other load on the node - it will be constantly swapping out anon
> > memory and evicting file caches. The application won't die quickly then,
> > but it will get a heavy slowdown, which is no better than killing I
> > guess.
>
> It will get vmpressure notifications though and can help to release
> excessive buffers which were allocated optimistically.

But the user will only get the notification *after* his application has
touched the memory within the limit, which may take quite a long time.

> > Also, I guess it'd be beneficial to have
> >
> > - mlocked pages accounting per cgroup, because they affect memory
> > reclaim, and how low/high limits work, so it'd be nice to have them
> > limited to a sane value;
> >
> > - shmem areas accounting per cgroup, because the total amount of shmem
> > on the system is limited, and it'll be no good if malicious
> > containers eat it all.
> >
> > IMO It wouldn't be a good idea to overwhelm memcg with those limits, the
> > VM controller suits much better.
>
> yeah, I do not think adding more to memcg is a good idea. I am still not
> sure whether working around bad design decisions in applications is a
> good rationale for a new controller.

Where do you see "bad design decision" in the example I've given above?
To recap, the user doesn't know how much memory his application is going
to consume and he wants to be notified about a potential failure as soon
as possible instead of waiting until it touches all the memory within
the container limit.

Also, what's wrong if an application wants to eat a lot of shared
memory, which is a limited resource? Suppose the user sets memsw.limit
for his container to half of RAM hoping it's isolated and won't cause
any troubles, but eventually he finds other workloads failing on the
host due to the processes inside it has eaten all available shmem.

Thanks.