2006-11-09 19:35:54

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 0/8] RSS controller for containers

Here is set of patches that implements a *simple minded* RSS controller
for containers. It would be nice to split up the memory controller design
and implementation in phases

1. RSS control
2. Page Cache control (with split clean and dirty accounting/control)
3. mlock() control
4. Kernel accounting and control

The beancounter implementation follows a very similar approach. The split
up makes the design of the controller easier. RSS for example, can be tracked
per mm_struct. Page Cache could be tracked per inode, per thread
or per mm_struct (depending on what form is most suitable).

The definition of RSS was debated on lkml, please see

http://lkml.org/lkml/2006/10/10/130

This patchset is a proof of concept implementation and the accounting can
be easily adapted to meet the definition of RSS as and when it is re-defined
or revisited. The changes required should be small.

The reclamation logic has been borrowed from Dave Hansen's challenged
memory controller and from shrink_all_memory(). The accounting was inspired
from Rohit Seth's container patches.

The good
--------

No additional pointers required in struct page.
There is also a lot of scope for code reuse in tracking the rss of a process
(this reuse is yet to be exploited).

The not so good
---------------
The patches contain a lot of debugging code.

Applying the patches
--------------------
This patchset has been developed on top of 2.6.19-rc2 with the latest
containers patch applied.

To run and test this patch, additional fixes are required.

Please see

http://lkml.org/lkml/2006/11/6/10
http://lkml.org/lkml/2006/11/6/245


Series
------
container-res-groups-fix-parsing.patch
container-memctlr-setup.patch
container-memctlr-callbacks.patch
container-memctlr-acct.patch
container-memctlr-task-migration.patch
container-memctlr-shares.patch
container-memctlr-reclaim.patch

Setup
-----
To test the series, here's what you need to do

0. Get the latest containers patches against 2.6.19-rc2
1. Apply all the fixes
2. Apply these patches
3. Build the kernel and mount the container filesystem
mount -t container container /container

4. Disable cpuset's (to simply assignment of tasks to resource groups)

cd /container
echo 0 > cpuset_enabled

5. Add the current task to a new group

mkdir /container/a
echo $$ > tasks
cat memctlr_stats

6. Set limits

echo "res=memctlr,max_shares=10" > memctlr_shares

7. Spin the system, hang it, revolve it, crash it!!
8. Please provide feedback, both code review and any thing else that
can be useful for further development

Testing
-------
Kernbench was run on these patches and it did not show any significant
overhead in the tests.


--

Balbir Singh,
Linux Technology Center,
IBM Software Labs


2006-11-09 19:36:17

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 2/8] RSS controller setup



Basic setup for a controller written for resource groups. This patch
registers a dummy controller.


Signed-off-by: Balbir Singh <[email protected]>
---

include/linux/memctlr.h | 31 ++++++++++++++
init/Kconfig | 11 +++++
kernel/res_group/Makefile | 1
kernel/res_group/memctlr.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 137 insertions(+)

diff -puN /dev/null include/linux/memctlr.h
--- /dev/null 2006-05-31 06:45:07.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/memctlr.h 2006-11-09 23:56:03.000000000 +0530
@@ -0,0 +1,31 @@
+/*
+ * Memory controller - "Account and Control Memory Usage"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * Author: Balbir Singh <[email protected]>
+ *
+ */
+
+#ifndef _LINUX_MEMCTRL_H
+#define _LINUX_MEMCTRL_H
+
+#ifdef CONFIG_RES_GROUPS_MEMORY
+#include <linux/res_group_rc.h>
+#endif /* CONFIG_RES_GROUPS_MEMORY */
+
+#endif /* _LINUX_MEMCTRL_H */
diff -puN init/Kconfig~container-memctlr-setup init/Kconfig
--- linux-2.6.19-rc2/init/Kconfig~container-memctlr-setup 2006-11-09 23:09:03.000000000 +0530
+++ linux-2.6.19-rc2-balbir/init/Kconfig 2006-11-09 23:56:47.000000000 +0530
@@ -325,6 +325,17 @@ config RES_GROUPS_NUMTASKS

Say N if unsure, Y to use the feature.

+config RES_GROUPS_MEMORY
+ bool "Memory Controller for RSS"
+ depends on RES_GROUPS
+ default y
+ help
+ Provides a Resource Controller for Resource Groups.
+ It limits the resident pages of the tasks belonging to the resource
+ group.
+
+ Say N if unsure, Y to use the feature.
+
endmenu
config SYSCTL
bool
diff -puN kernel/res_group/Makefile~container-memctlr-setup kernel/res_group/Makefile
--- linux-2.6.19-rc2/kernel/res_group/Makefile~container-memctlr-setup 2006-11-09 23:09:03.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/Makefile 2006-11-09 23:09:03.000000000 +0530
@@ -1,2 +1,3 @@
obj-y = res_group.o shares.o rgcs.o
obj-$(CONFIG_RES_GROUPS_NUMTASKS) += numtasks.o
+obj-$(CONFIG_RES_GROUPS_MEMORY) += memctlr.o
diff -puN /dev/null kernel/res_group/memctlr.c
--- /dev/null 2006-05-31 06:45:07.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c 2006-11-09 23:56:03.000000000 +0530
@@ -0,0 +1,94 @@
+/*
+ * Memory controller - "Account and Control Memory Usage"
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * Author: Balbir Singh <[email protected]>
+ *
+ */
+
+/*
+ * Simple memory controller.
+ * Supports limits, guarantees not supported right now
+ *
+ * Tasks are group'ed virtually by thread groups - Add more details
+ */
+
+#include <linux/module.h>
+#include <linux/res_group_rc.h>
+#include <linux/memctlr.h>
+
+static const char res_ctlr_name[] = "memctlr";
+static struct resource_group *root_rgroup;
+
+struct mem_counter {
+ atomic_long_t rss;
+};
+
+struct memctlr {
+ struct resource_group *rgroup; /* My resource group */
+ struct res_shares shares; /* My shares */
+
+ struct mem_counter counter; /* Accounting information */
+ /* Statistics */
+ int successes;
+ int failures;
+};
+
+struct res_controller memctlr_rg;
+
+static struct memctlr *get_memctlr_from_shares(struct res_shares *shares)
+{
+ if (shares)
+ return container_of(shares, struct memctlr, shares);
+ return NULL;
+}
+
+static struct memctlr *get_memctlr(struct resource_group *rgroup)
+{
+ return get_memctlr_from_shares(get_controller_shares(rgroup,
+ &memctlr_rg));
+}
+
+struct res_controller memctlr_rg = {
+ .name = res_ctlr_name,
+ .ctlr_id = NO_RES_ID,
+ .alloc_shares_struct = NULL,
+ .free_shares_struct = NULL,
+ .move_task = NULL,
+ .shares_changed = NULL,
+ .show_stats = NULL,
+};
+
+int __init memctlr_init(void)
+{
+ if (memctlr_rg.ctlr_id != NO_RES_ID)
+ return -EBUSY; /* already registered */
+ return register_controller(&memctlr_rg);
+}
+
+void __exit memctlr_exit(void)
+{
+ int rc;
+ do {
+ rc = unregister_controller(&memctlr_rg);
+ } while (rc == -EBUSY);
+ BUG_ON(rc != 0);
+}
+
+module_init(memctlr_init);
+module_exit(memctlr_exit);
_

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-09 19:36:48

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 5/8] RSS controller task migration support



Support migration of tasks across groups. Migration uses the accounting
information tracked in the mm_struct to add/delete RSS from the container as
a process migrates from one container to the next.

This patch also adds a /proc/<tid>/memacct interface for debugging purposes.
/proc/<tid>/memacct prints the rss of the task

1. As accounted by the patches
2. By walking the page tables of the process


Signed-off-by: Balbir Singh <[email protected]>
---

fs/proc/base.c | 4
include/linux/memctlr.h | 9 +
include/linux/rmap.h | 6 -
kernel/res_group/memctlr.c | 228 ++++++++++++++++++++++++++++++++++++++++++---
mm/filemap_xip.c | 2
mm/fremap.c | 2
mm/memory.c | 6 -
mm/rmap.c | 6 -
8 files changed, 236 insertions(+), 27 deletions(-)

diff -puN kernel/res_group/memctlr.c~container-memctlr-task-migration kernel/res_group/memctlr.c
--- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-task-migration 2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c 2006-11-09 21:56:49.000000000 +0530
@@ -31,10 +31,12 @@
#include <linux/module.h>
#include <linux/res_group_rc.h>
#include <linux/memctlr.h>
+#include <linux/mm.h>
+#include <asm/pgtable.h>

static const char res_ctlr_name[] = "memctlr";
static struct resource_group *root_rgroup;
-static const char version[] = "0.01";
+static const char version[] = "0.05";
static struct memctlr *memctlr_root;

#define MEMCTLR_MAGIC 0xdededede
@@ -52,6 +54,7 @@ struct memctlr {
int successes;
int failures;
int magic;
+ spinlock_t lock;
};

struct res_controller memctlr_rg;
@@ -95,7 +98,7 @@ void mm_assign_container(struct mm_struc
rcu_read_unlock();
}

-static inline struct memctlr *get_memctlr_from_page(struct page *page)
+static inline struct memctlr *get_task_memctlr(struct task_struct *p)
{
struct resource_group *rgroup;
struct memctlr *res;
@@ -107,7 +110,7 @@ static inline struct memctlr *get_memctl
return NULL;

rcu_read_lock();
- rgroup = (struct resource_group *)rcu_dereference(current->container);
+ rgroup = (struct resource_group *)rcu_dereference(p->container);
rcu_read_unlock();

res = get_memctlr(rgroup);
@@ -119,31 +122,54 @@ static inline struct memctlr *get_memctl
}


-void memctlr_inc_rss(struct page *page)
+void memctlr_inc_rss_mm(struct page *page, struct mm_struct *mm)
{
struct memctlr *res;

- res = get_memctlr_from_page(page);
- if (!res)
+ res = get_task_memctlr(current);
+ if (!res) {
+ printk(KERN_INFO "inc_rss no res set *---*\n");
return;
+ }

- atomic_long_inc(&current->mm->counter->rss);
+ spin_lock(&res->lock);
+ atomic_long_inc(&mm->counter->rss);
atomic_long_inc(&res->counter.rss);
+ spin_unlock(&res->lock);
}

-void memctlr_dec_rss(struct page *page)
+void memctlr_inc_rss(struct page *page)
{
struct memctlr *res;
+ struct mm_struct *mm = get_task_mm(current);

- res = get_memctlr_from_page(page);
- if (!res)
+ res = get_task_memctlr(current);
+ if (!res) {
+ printk(KERN_INFO "inc_rss no res set *---*\n");
return;
+ }

- atomic_long_dec(&res->counter.rss);
+ spin_lock(&res->lock);
+ atomic_long_inc(&mm->counter->rss);
+ atomic_long_inc(&res->counter.rss);
+ spin_unlock(&res->lock);
+ mmput(mm);
+}

- if ((current->flags & PF_EXITING) && !current->mm)
+void memctlr_dec_rss(struct page *page, struct mm_struct *mm)
+{
+ struct memctlr *res;
+
+ res = get_task_memctlr(current);
+ if (!res) {
+ printk(KERN_INFO "dec_rss no res set *---*\n");
return;
- atomic_long_dec(&current->mm->counter->rss);
+ }
+
+ spin_lock(&res->lock);
+ atomic_long_dec(&res->counter.rss);
+ atomic_long_dec(&mm->counter->rss);
+ spin_unlock(&res->lock);
}

static void memctlr_init_new(struct memctlr *res)
@@ -154,6 +180,7 @@ static void memctlr_init_new(struct memc
res->shares.unused_min_shares = SHARE_DEFAULT_DIVISOR;

memctlr_init_mem_counter(&res->counter);
+ spin_lock_init(&res->lock);
}

static struct res_shares *memctlr_alloc_instance(struct resource_group *rgroup)
@@ -188,6 +215,122 @@ static void memctlr_free_instance(struct
kfree(res);
}

+static long count_pte_rss(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long addr, unsigned long end)
+{
+ pte_t *pte;
+ long count = 0;
+
+ do {
+ pte = pte_offset_map(pmd, addr);
+ if (!pte_present(*pte))
+ continue;
+ count++;
+ pte_unmap(pte);
+ } while (pte++, addr += PAGE_SIZE, (addr != end));
+
+ return count;
+}
+
+static long count_pmd_rss(struct vm_area_struct *vma, pud_t *pud,
+ unsigned long addr, unsigned long end)
+{
+ pmd_t *pmd;
+ unsigned long next;
+ long count = 0;
+
+ pmd = pmd_offset(pud, addr);
+ do {
+ next = pmd_addr_end(addr, end);
+ if (pmd_none_or_clear_bad(pmd))
+ continue;
+ count += count_pte_rss(vma, pmd, addr, next);
+ } while (pmd++, addr = next, (addr != end));
+
+ return count;
+}
+
+static long count_pud_rss(struct vm_area_struct *vma, pgd_t *pgd,
+ unsigned long addr, unsigned long end)
+{
+ pud_t *pud;
+ unsigned long next;
+ long count = 0;
+
+ pud = pud_offset(pgd, addr);
+ do {
+ next = pud_addr_end(addr, end);
+ if (pud_none_or_clear_bad(pud))
+ continue;
+ count += count_pmd_rss(vma, pud, addr, next);
+ } while (pud++, addr = next, (addr != end));
+
+ return count;
+}
+
+static long count_pgd_rss(struct vm_area_struct *vma)
+{
+ unsigned long addr, next, end;
+ pgd_t *pgd;
+ long count = 0;
+
+ addr = vma->vm_start;
+ end = vma->vm_end;
+
+ pgd = pgd_offset(vma->vm_mm, addr);
+ do {
+ next = pgd_addr_end(addr, end);
+ if (pgd_none_or_clear_bad(pgd))
+ continue;
+ count += count_pud_rss(vma, pgd, addr, next);
+ } while (pgd++, addr = next, (addr != end));
+ return count;
+}
+
+static long count_rss(struct task_struct *p)
+{
+ int count = 0;
+ struct mm_struct *mm = get_task_mm(p);
+ struct vm_area_struct *vma = mm->mmap;
+
+ if (!mm)
+ return 0;
+
+ down_read(&mm->mmap_sem);
+ spin_lock(&mm->page_table_lock);
+
+ while (vma) {
+ count += count_pgd_rss(vma);
+ vma = vma->vm_next;
+ }
+
+ spin_unlock(&mm->page_table_lock);
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ return count;
+}
+
+int proc_memacct(struct task_struct *p, char *buf)
+{
+ int i = 0, j = 0;
+ struct mm_struct *mm = get_task_mm(p);
+
+ if (!mm)
+ return sprintf(buf, "no mm associated with the task\n");
+
+ i = sprintf(buf, "rss pages %ld\n",
+ atomic_long_read(&mm->counter->rss));
+ buf += i;
+ j += i;
+
+ i = sprintf(buf, "pg table walk rss pages %ld\n", count_rss(p));
+ buf += i;
+ j += i;
+
+ mmput(mm);
+ return j;
+}
+
static ssize_t memctlr_show_stats(struct res_shares *shares, char *buf,
size_t len)
{
@@ -206,12 +349,69 @@ static ssize_t memctlr_show_stats(struct
return j;
}

+static void double_res_lock(struct memctlr *old, struct memctlr *new)
+{
+ BUG_ON(old == new);
+ if (&old->lock > &new->lock) {
+ spin_lock(&old->lock);
+ spin_lock(&new->lock);
+ } else {
+ spin_lock(&new->lock);
+ spin_lock(&old->lock);
+ }
+}
+
+static void double_res_unlock(struct memctlr *old, struct memctlr *new)
+{
+ BUG_ON(old == new);
+ if (&old->lock > &new->lock) {
+ spin_unlock(&new->lock);
+ spin_unlock(&old->lock);
+ } else {
+ spin_unlock(&old->lock);
+ spin_unlock(&new->lock);
+ }
+}
+
+static void memctlr_move_task(struct task_struct *p, struct res_shares *old,
+ struct res_shares *new)
+{
+ struct memctlr *oldres, *newres;
+ long rss_pages;
+
+ if (old == new)
+ return;
+
+ /*
+ * If a task has no mm structure associated with it we have
+ * nothing to do
+ */
+ if (!old || !new)
+ return;
+
+ if (p->pid != p->tgid)
+ return;
+
+ oldres = get_memctlr_from_shares(old);
+ newres = get_memctlr_from_shares(new);
+
+ double_res_lock(oldres, newres);
+
+ rss_pages = atomic_long_read(&p->mm->counter->rss);
+ atomic_long_sub(rss_pages, &oldres->counter.rss);
+
+ mm_assign_container(p->mm, p);
+ atomic_long_add(rss_pages, &newres->counter.rss);
+
+ double_res_unlock(oldres, newres);
+}
+
struct res_controller memctlr_rg = {
.name = res_ctlr_name,
.ctlr_id = NO_RES_ID,
.alloc_shares_struct = memctlr_alloc_instance,
.free_shares_struct = memctlr_free_instance,
- .move_task = NULL,
+ .move_task = memctlr_move_task,
.shares_changed = NULL,
.show_stats = memctlr_show_stats,
};
diff -puN fs/proc/base.c~container-memctlr-task-migration fs/proc/base.c
--- linux-2.6.19-rc2/fs/proc/base.c~container-memctlr-task-migration 2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/fs/proc/base.c 2006-11-09 21:56:49.000000000 +0530
@@ -72,6 +72,7 @@
#include <linux/audit.h>
#include <linux/poll.h>
#include <linux/nsproxy.h>
+#include <linux/memctlr.h>
#include "internal.h"

/* NOTE:
@@ -1759,6 +1760,9 @@ static struct pid_entry tgid_base_stuff[
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, numa_maps),
#endif
+#ifdef CONFIG_RES_GROUPS_MEMORY
+ INF("memacct", S_IRUGO, memacct),
+#endif
REG("mem", S_IRUSR|S_IWUSR, mem),
#ifdef CONFIG_SECCOMP
REG("seccomp", S_IRUSR|S_IWUSR, seccomp),
diff -puN include/linux/memctlr.h~container-memctlr-task-migration include/linux/memctlr.h
--- linux-2.6.19-rc2/include/linux/memctlr.h~container-memctlr-task-migration 2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/memctlr.h 2006-11-09 21:56:49.000000000 +0530
@@ -30,15 +30,20 @@
extern int mm_init_mem_counter(struct mm_struct *mm);
extern void mm_assign_container(struct mm_struct *mm, struct task_struct *p);
extern void memctlr_inc_rss(struct page *page);
-extern void memctlr_dec_rss(struct page *page);
+extern void memctlr_inc_rss_mm(struct page *page, struct mm_struct *mm);
+extern void memctlr_dec_rss(struct page *page, struct mm_struct *mm);
extern void mm_free_mem_counter(struct mm_struct *mm);
+extern int proc_memacct(struct task_struct *task, char *buffer);

#else /* CONFIG_RES_GROUPS_MEMORY */

void memctlr_inc_rss(struct page *page)
{}

-void memctlr_dec_rss(struct page *page)
+void memctlr_inc_rss_mm(struct page *page, struct mm_struct *mm)
+{}
+
+void memctlr_dec_rss(struct page *page, struct mm_struct *mm)
{}

int mm_init_mem_counter(struct mm_struct *mm)
diff -puN mm/filemap_xip.c~container-memctlr-task-migration mm/filemap_xip.c
--- linux-2.6.19-rc2/mm/filemap_xip.c~container-memctlr-task-migration 2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/filemap_xip.c 2006-11-09 21:56:49.000000000 +0530
@@ -189,7 +189,7 @@ __xip_unmap (struct address_space * mapp
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
- page_remove_rmap(page);
+ page_remove_rmap(page, mm);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
diff -puN mm/fremap.c~container-memctlr-task-migration mm/fremap.c
--- linux-2.6.19-rc2/mm/fremap.c~container-memctlr-task-migration 2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/fremap.c 2006-11-09 21:56:49.000000000 +0530
@@ -33,7 +33,7 @@ static int zap_pte(struct mm_struct *mm,
if (page) {
if (pte_dirty(pte))
set_page_dirty(page);
- page_remove_rmap(page);
+ page_remove_rmap(page, mm);
page_cache_release(page);
}
} else {
diff -puN mm/memory.c~container-memctlr-task-migration mm/memory.c
--- linux-2.6.19-rc2/mm/memory.c~container-memctlr-task-migration 2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/memory.c 2006-11-09 21:56:49.000000000 +0530
@@ -481,7 +481,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
page = vm_normal_page(vma, addr, pte);
if (page) {
get_page(page);
- page_dup_rmap(page);
+ page_dup_rmap(page, dst_mm);
rss[!!PageAnon(page)]++;
}

@@ -681,7 +681,7 @@ static unsigned long zap_pte_range(struc
mark_page_accessed(page);
file_rss--;
}
- page_remove_rmap(page);
+ page_remove_rmap(page, mm);
tlb_remove_page(tlb, page);
continue;
}
@@ -1575,7 +1575,7 @@ gotten:
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
- page_remove_rmap(old_page);
+ page_remove_rmap(old_page, mm);
if (!PageAnon(old_page)) {
dec_mm_counter(mm, file_rss);
inc_mm_counter(mm, anon_rss);
diff -puN mm/rmap.c~container-memctlr-task-migration mm/rmap.c
--- linux-2.6.19-rc2/mm/rmap.c~container-memctlr-task-migration 2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/rmap.c 2006-11-09 21:56:49.000000000 +0530
@@ -576,7 +576,7 @@ void page_add_file_rmap(struct page *pag
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, struct mm_struct *mm)
{
if (atomic_add_negative(-1, &page->_mapcount)) {
if (unlikely(page_mapcount(page) < 0)) {
@@ -689,7 +689,7 @@ static int try_to_unmap_one(struct page
dec_mm_counter(mm, file_rss);


- page_remove_rmap(page);
+ page_remove_rmap(page, mm);
page_cache_release(page);

out_unmap:
@@ -779,7 +779,7 @@ static void try_to_unmap_cluster(unsigne
if (pte_dirty(pteval))
set_page_dirty(page);

- page_remove_rmap(page);
+ page_remove_rmap(page, mm);
page_cache_release(page);
dec_mm_counter(mm, file_rss);
(*mapcount)--;
diff -puN include/linux/rmap.h~container-memctlr-task-migration include/linux/rmap.h
--- linux-2.6.19-rc2/include/linux/rmap.h~container-memctlr-task-migration 2006-11-09 21:56:49.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/rmap.h 2006-11-09 21:56:49.000000000 +0530
@@ -73,7 +73,7 @@ void __anon_vma_link(struct vm_area_stru
void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_remove_rmap(struct page *, struct mm_struct *);

/**
* page_dup_rmap - duplicate pte mapping to a page
@@ -82,10 +82,10 @@ void page_remove_rmap(struct page *);
* For copy_page_range only: minimal extract from page_add_rmap,
* avoiding unnecessary tests (already checked) so it's quicker.
*/
-static inline void page_dup_rmap(struct page *page)
+static inline void page_dup_rmap(struct page *page, struct mm_struct *mm)
{
atomic_inc(&page->_mapcount);
- memctlr_inc_rss(page);
+ memctlr_inc_rss_mm(page, mm);
}

/*
_

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-09 19:36:34

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 3/8] RSS controller add callbacks



Add callbacks to allocate and free instances of the controller as the
hierarchy of resource groups is modified.

Signed-off-by: Balbir Singh <[email protected]>
---

kernel/res_group/memctlr.c | 58 ++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 55 insertions(+), 3 deletions(-)

diff -puN kernel/res_group/memctlr.c~container-memctlr-callbacks kernel/res_group/memctlr.c
--- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-callbacks 2006-11-09 21:42:35.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c 2006-11-09 21:42:35.000000000 +0530
@@ -34,6 +34,8 @@

static const char res_ctlr_name[] = "memctlr";
static struct resource_group *root_rgroup;
+static const char version[] = "0.01";
+static struct memctlr *memctlr_root;

struct mem_counter {
atomic_long_t rss;
@@ -64,14 +66,64 @@ static struct memctlr *get_memctlr(struc
&memctlr_rg));
}

+static void memctlr_init_new(struct memctlr *res)
+{
+ res->shares.min_shares = SHARE_DONT_CARE;
+ res->shares.max_shares = SHARE_DONT_CARE;
+ res->shares.child_shares_divisor = SHARE_DEFAULT_DIVISOR;
+ res->shares.unused_min_shares = SHARE_DEFAULT_DIVISOR;
+}
+
+static struct res_shares *memctlr_alloc_instance(struct resource_group *rgroup)
+{
+ struct memctlr *res;
+
+ res = kzalloc(sizeof(struct memctlr), GFP_KERNEL);
+ if (!res)
+ return NULL;
+ res->rgroup = rgroup;
+ memctlr_init_new(res);
+ if (is_res_group_root(rgroup)) {
+ root_rgroup = rgroup;
+ memctlr_root = res;
+ printk("Memory Controller version %s\n", version);
+ }
+ return &res->shares;
+}
+
+static void memctlr_free_instance(struct res_shares *shares)
+{
+ struct memctlr *res, *parres;
+
+ res = get_memctlr_from_shares(shares);
+ BUG_ON(!res);
+ /*
+ * Containers do not allow removal of groups that have tasks
+ * associated with them. To free a container, it must be empty.
+ * Handle transfer of charges in the move_task notification
+ */
+ kfree(res);
+}
+
+static ssize_t memctlr_show_stats(struct res_shares *shares, char *buf,
+ size_t len)
+{
+ int i = 0;
+
+ i += snprintf(buf, len, "Accounting will be added soon\n");
+ buf += i;
+ len -= i;
+ return i;
+}
+
struct res_controller memctlr_rg = {
.name = res_ctlr_name,
.ctlr_id = NO_RES_ID,
- .alloc_shares_struct = NULL,
- .free_shares_struct = NULL,
+ .alloc_shares_struct = memctlr_alloc_instance,
+ .free_shares_struct = memctlr_free_instance,
.move_task = NULL,
.shares_changed = NULL,
- .show_stats = NULL,
+ .show_stats = memctlr_show_stats,
};

int __init memctlr_init(void)
_

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-09 19:35:58

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 1/8] Fix resource groups parsing, while assigning shares



echo adds a "\n" to the end of a string. When this string is copied from
user space, we need to remove it, so that match_token() can parse
the user space string correctly

Signed-off-by: Balbir Singh <[email protected]>
---

kernel/res_group/rgcs.c | 6 ++++++
1 file changed, 6 insertions(+)

diff -puN kernel/res_group/rgcs.c~container-res-groups-fix-parsing kernel/res_group/rgcs.c
--- linux-2.6.19-rc2/kernel/res_group/rgcs.c~container-res-groups-fix-parsing 2006-11-09 23:08:10.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/rgcs.c 2006-11-09 23:08:10.000000000 +0530
@@ -241,6 +241,12 @@ ssize_t res_group_file_write(struct cont
}
buf[nbytes] = 0; /* nul-terminate */

+ /*
+ * Ignore "\n". It might come in from echo(1)
+ */
+ if (buf[nbytes - 1] == '\n')
+ buf[nbytes - 1] = 0;
+
container_manage_lock();

if (container_is_removed(cont)) {
_

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-09 19:37:59

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 8/8] RSS controller support reclamation



Reclaim memory as we hit the max_shares limit. The code for reclamation
is inspired from Dave Hansen's challenged memory controller and from the
shrink_all_memory() code

Reclamation can be triggered from two paths

1. While incrementing the RSS, we hit the limit of the container
2. A container is resized, such that it's new limit is below its current
RSS

In (1) reclamation takes place in the background.

TODO's

1. max_shares currently works like a soft limit. The RSS can grow beyond it's
limit. One possible fix is to introduce a soft limit (reclaim when the
container hits the soft limit) and fail when we hit the hard limit

Signed-off-by: Balbir Singh <[email protected]>
---

include/linux/memctlr.h | 17 ++++++
kernel/fork.c | 1
kernel/res_group/memctlr.c | 116 ++++++++++++++++++++++++++++++++++++++-------
mm/rmap.c | 72 +++++++++++++++++++++++++++
mm/vmscan.c | 72 +++++++++++++++++++++++++++
5 files changed, 260 insertions(+), 18 deletions(-)

diff -puN mm/vmscan.c~container-memctlr-reclaim mm/vmscan.c
--- linux-2.6.19-rc2/mm/vmscan.c~container-memctlr-reclaim 2006-11-09 22:21:11.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/vmscan.c 2006-11-09 22:21:11.000000000 +0530
@@ -36,6 +36,8 @@
#include <linux/rwsem.h>
#include <linux/delay.h>
#include <linux/kthread.h>
+#include <linux/container.h>
+#include <linux/memctlr.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -65,6 +67,9 @@ struct scan_control {
int swappiness;

int all_unreclaimable;
+
+ int overlimit;
+ void *container; /* Added as void * to avoid #ifdef's */
};

/*
@@ -811,6 +816,10 @@ force_reclaim_mapped:
cond_resched();
page = lru_to_page(&l_hold);
list_del(&page->lru);
+ if (!memctlr_page_reclaim(page, sc->container, sc->overlimit)) {
+ list_add(&page->lru, &l_active);
+ continue;
+ }
if (page_mapped(page)) {
if (!reclaim_mapped ||
(total_swap_pages == 0 && PageAnon(page)) ||
@@ -1008,6 +1017,8 @@ unsigned long try_to_free_pages(struct z
.swap_cluster_max = SWAP_CLUSTER_MAX,
.may_swap = 1,
.swappiness = vm_swappiness,
+ .overlimit = SC_OVERLIMIT_NONE,
+ .container = NULL,
};

count_vm_event(ALLOCSTALL);
@@ -1104,6 +1115,8 @@ static unsigned long balance_pgdat(pg_da
.may_swap = 1,
.swap_cluster_max = SWAP_CLUSTER_MAX,
.swappiness = vm_swappiness,
+ .overlimit = SC_OVERLIMIT_NONE,
+ .container = NULL,
};

loop_again:
@@ -1324,7 +1337,7 @@ void wakeup_kswapd(struct zone *zone, in
wake_up_interruptible(&pgdat->kswapd_wait);
}

-#ifdef CONFIG_PM
+#if defined(CONFIG_PM) || defined(CONFIG_RES_GROUPS_MEMORY)
/*
* Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages
* from LRU lists system-wide, for given pass and priority, and returns the
@@ -1368,7 +1381,60 @@ static unsigned long shrink_all_zones(un

return ret;
}
+#endif

+#ifdef CONFIG_RES_GROUPS_MEMORY
+/*
+ * Modelled after shrink_all_memory
+ */
+unsigned long memctlr_shrink_container_memory(unsigned long nr_pages,
+ struct container *container,
+ int overlimit)
+{
+ unsigned long lru_pages;
+ unsigned long ret = 0;
+ int pass;
+ struct zone *zone;
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .may_swap = 0,
+ .swap_cluster_max = nr_pages,
+ .may_writepage = 1,
+ .swappiness = vm_swappiness,
+ .overlimit = overlimit,
+ .container = container,
+ };
+
+ lru_pages = 0;
+ for_each_zone(zone)
+ lru_pages += zone->nr_active + zone->nr_inactive;
+
+ for (pass = 0; pass < 5; pass++) {
+ int prio;
+
+ /* Force reclaiming mapped pages in the passes #3 and #4 */
+ if (pass > 2) {
+ sc.may_swap = 1;
+ sc.swappiness = 100;
+ }
+
+ for (prio = DEF_PRIORITY; prio >= 0; prio--) {
+ unsigned long nr_to_scan = nr_pages - ret;
+
+ sc.nr_scanned = 0;
+ ret += shrink_all_zones(nr_to_scan, prio, pass, &sc);
+ if (ret >= nr_pages)
+ break;
+
+ if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
+ blk_congestion_wait(WRITE, HZ / 10);
+ }
+ }
+ return ret;
+}
+#endif
+
+#ifdef CONFIG_PM
/*
* Try to free `nr_pages' of memory, system-wide, and return the number of
* freed pages.
@@ -1390,6 +1456,8 @@ unsigned long shrink_all_memory(unsigned
.swap_cluster_max = nr_pages,
.may_writepage = 1,
.swappiness = vm_swappiness,
+ .overlimit = SC_OVERLIMIT_NONE,
+ .container = NULL,
};

current->reclaim_state = &reclaim_state;
@@ -1585,6 +1653,8 @@ static int __zone_reclaim(struct zone *z
SWAP_CLUSTER_MAX),
.gfp_mask = gfp_mask,
.swappiness = vm_swappiness,
+ .overlimit = SC_OVERLIMIT_NONE,
+ .container = NULL,
};
unsigned long slab_reclaimable;

diff -puN kernel/res_group/memctlr.c~container-memctlr-reclaim kernel/res_group/memctlr.c
--- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-reclaim 2006-11-09 22:21:11.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c 2006-11-09 22:21:11.000000000 +0530
@@ -33,6 +33,7 @@
#include <linux/memctlr.h>
#include <linux/mm.h>
#include <linux/swap.h>
+#include <linux/workqueue.h>
#include <asm/pgtable.h>

static const char res_ctlr_name[] = "memctlr";
@@ -40,7 +41,10 @@ static struct resource_group *root_rgrou
static const char version[] = "0.05";
static struct memctlr *memctlr_root;

-#define MEMCTLR_MAGIC 0xdededede
+static void memctlr_callback(void *data);
+static atomic_long_t failed_inc_rss;
+static atomic_long_t failed_dec_rss;
+

struct mem_counter {
atomic_long_t rss;
@@ -57,9 +61,12 @@ struct memctlr {
int magic;
spinlock_t lock;
long nr_pages;
+ int reclaim_in_progress;
};

struct res_controller memctlr_rg;
+static DECLARE_WORK(memctlr_work, memctlr_callback, NULL);
+#define MEMCTLR_MAGIC 0xdededede

static struct memctlr *get_memctlr_from_shares(struct res_shares *shares)
{
@@ -96,7 +103,7 @@ void mm_free_mem_counter(struct mm_struc
void mm_assign_container(struct mm_struct *mm, struct task_struct *p)
{
rcu_read_lock();
- mm->container = rcu_dereference(p->container);
+ rcu_assign_pointer(mm->container, rcu_dereference(p->container));
rcu_read_unlock();
}

@@ -123,38 +130,64 @@ static inline struct memctlr *get_task_m
return res;
}

-
-void memctlr_inc_rss_mm(struct page *page, struct mm_struct *mm)
+static void memctlr_callback(void *data)
{
- struct memctlr *res;
+ struct memctlr *res = (struct memctlr *)data;
+ long rss;
+ unsigned long nr_shrink = 0;

- res = get_task_memctlr(current);
- if (!res) {
- printk(KERN_INFO "inc_rss no res set *---*\n");
- return;
- }
+ BUG_ON(!res);

spin_lock(&res->lock);
- atomic_long_inc(&mm->counter->rss);
- atomic_long_inc(&res->counter.rss);
+ rss = atomic_long_read(&res->counter.rss);
+ if ((rss > res->nr_pages) && (res->nr_pages > 0))
+ nr_shrink = rss - ((res->nr_pages * 4) / 5);
+ spin_unlock(&res->lock);
+
+ if (nr_shrink)
+ memctlr_shrink_container_memory(nr_shrink, res->rgroup,
+ SC_OVERLIMIT_ONE);
+ spin_lock(&res->lock);
+ res->reclaim_in_progress = 0;
spin_unlock(&res->lock);
}

-void memctlr_inc_rss(struct page *page)
+void memctlr_inc_rss_mm(struct page *page, struct mm_struct *mm)
{
struct memctlr *res;
- struct mm_struct *mm = get_task_mm(current);
+ long rss;

res = get_task_memctlr(current);
if (!res) {
- printk(KERN_INFO "inc_rss no res set *---*\n");
+ atomic_long_inc(&failed_inc_rss);
return;
}

spin_lock(&res->lock);
atomic_long_inc(&mm->counter->rss);
atomic_long_inc(&res->counter.rss);
+ rss = atomic_long_read(&res->counter.rss);
+ if ((res->nr_pages < rss) && (res->nr_pages > 0)) {
+ /*
+ * Reclaim if we exceed our limit
+ * Schedule a job to do so
+ */
+ if (res->reclaim_in_progress)
+ goto done;
+ res->reclaim_in_progress = 1;
+ spin_unlock(&res->lock);
+ PREPARE_WORK(&memctlr_work, memctlr_callback, res);
+ schedule_work(&memctlr_work);
+ return;
+ }
+done:
spin_unlock(&res->lock);
+}
+
+void memctlr_inc_rss(struct page *page)
+{
+ struct mm_struct *mm = get_task_mm(current);
+ memctlr_inc_rss_mm(page, mm);
mmput(mm);
}

@@ -162,9 +195,9 @@ void memctlr_dec_rss(struct page *page,
{
struct memctlr *res;

- res = get_task_memctlr(current);
+ res = get_memctlr(mm->container);
if (!res) {
- printk(KERN_INFO "dec_rss no res set *---*\n");
+ atomic_long_inc(&failed_dec_rss);
return;
}

@@ -183,6 +216,7 @@ static void memctlr_init_new(struct memc

memctlr_init_mem_counter(&res->counter);
res->nr_pages = SHARE_DONT_CARE;
+ res->reclaim_in_progress = 0;
spin_lock_init(&res->lock);
}

@@ -200,6 +234,7 @@ static struct res_shares *memctlr_alloc_
root_rgroup = rgroup;
memctlr_root = res;
res->nr_pages = nr_free_pages();
+ res->shares.max_shares = SHARE_DEFAULT_DIVISOR;
printk("Memory Controller version %s\n", version);
}
return &res->shares;
@@ -355,6 +390,20 @@ static ssize_t memctlr_show_stats(struct
buf += i;
len -= i;
j += i;
+
+ i = snprintf(buf, len, "Failed INC RSS Pages %ld\n",
+ atomic_long_read(&failed_inc_rss));
+
+ buf += i;
+ len -= i;
+ j += i;
+
+ i = snprintf(buf, len, "Failed DEC RSS Pages %ld\n",
+ atomic_long_read(&failed_dec_rss));
+
+ buf += i;
+ len -= i;
+ j += i;
return j;
}

@@ -421,6 +470,8 @@ static void recalc_and_propagate(struct
int child_divisor;
u64 numerator;
struct memctlr *child_res;
+ long rss;
+ unsigned long nr_shrink = 0;

if (parres) {
if (res->shares.max_shares == SHARE_DONT_CARE ||
@@ -445,6 +496,35 @@ static void recalc_and_propagate(struct
recalc_and_propagate(child_res, res);
}

+ /*
+ * Reclaim if our limit was shrunk
+ */
+ spin_lock(&res->lock);
+ rss = atomic_long_read(&res->counter.rss);
+ if ((rss > res->nr_pages) && (res->nr_pages > 0))
+ nr_shrink = rss - ((res->nr_pages * 4) / 5);
+ spin_unlock(&res->lock);
+
+ if (nr_shrink)
+ memctlr_shrink_container_memory(nr_shrink, NULL,
+ SC_OVERLIMIT_ALL);
+}
+
+int memctlr_over_limit(struct container *container)
+{
+ struct resource_group *rgroup = container;
+ struct memctlr *res;
+ int ret = 0;
+
+ res = get_memctlr(rgroup);
+ if (!res)
+ return ret;
+
+ spin_lock(&res->lock);
+ if (atomic_long_read(&res->counter.rss) > res->nr_pages)
+ ret = 1;
+ spin_unlock(&res->lock);
+ return ret;
}

static void memctlr_shares_changed(struct res_shares *shares)
@@ -477,6 +557,8 @@ int __init memctlr_init(void)
{
if (memctlr_rg.ctlr_id != NO_RES_ID)
return -EBUSY; /* already registered */
+ atomic_long_set(&failed_inc_rss, 0);
+ atomic_long_set(&failed_dec_rss, 0);
return register_controller(&memctlr_rg);
}

diff -puN include/linux/memctlr.h~container-memctlr-reclaim include/linux/memctlr.h
--- linux-2.6.19-rc2/include/linux/memctlr.h~container-memctlr-reclaim 2006-11-09 22:21:11.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/memctlr.h 2006-11-09 22:21:11.000000000 +0530
@@ -34,6 +34,12 @@ extern void memctlr_inc_rss_mm(struct pa
extern void memctlr_dec_rss(struct page *page, struct mm_struct *mm);
extern void mm_free_mem_counter(struct mm_struct *mm);
extern int proc_memacct(struct task_struct *task, char *buffer);
+extern unsigned long memctlr_shrink_container_memory(unsigned long nr_pages,
+ struct container *container,
+ int overlimit);
+extern int memctlr_page_reclaim(struct page *page, void *container,
+ int overlimit);
+extern int memctlr_over_limit(struct container *container);

#else /* CONFIG_RES_GROUPS_MEMORY */

@@ -54,9 +60,20 @@ int mm_init_mem_counter(struct mm_struct
void mm_assign_container(struct mm_struct *mm, struct task_struct *p)
{}

+int memctlr_page_reclaim(struct page *page, void *container, int overlimit)
+{
+ return 1;
+}
+
void mm_free_mem_counter(struct mm_struct *mm)
{}

#endif /* CONFIG_RES_GROUPS_MEMORY */

+enum {
+ SC_OVERLIMIT_NONE, /* The scan is container independent */
+ SC_OVERLIMIT_ONE, /* Scan the one container specified */
+ SC_OVERLIMIT_ALL, /* Scan all containers */
+};
+
#endif /* _LINUX_MEMCTRL_H */
diff -puN mm/rmap.c~container-memctlr-reclaim mm/rmap.c
--- linux-2.6.19-rc2/mm/rmap.c~container-memctlr-reclaim 2006-11-09 22:21:11.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/rmap.c 2006-11-09 22:21:11.000000000 +0530
@@ -604,6 +604,78 @@ void page_remove_rmap(struct page *page,
memctlr_dec_rss(page, mm);
}

+#ifdef CONFIG_RES_GROUPS_MEMORY
+/*
+ * Can we push this code down to try_to_unmap()?
+ */
+int memctlr_page_reclaim(struct page *page, void *container, int overlimit)
+{
+ int ret = 0;
+
+ if (overlimit == SC_OVERLIMIT_NONE)
+ return 1;
+ if (container == NULL && overlimit != SC_OVERLIMIT_ALL)
+ return 1;
+
+ if (!page_mapped(page))
+ return 0;
+
+ if (PageAnon(page)) {
+ struct anon_vma *anon_vma;
+ struct vm_area_struct *vma;
+
+ anon_vma = page_lock_anon_vma(page);
+ if (!anon_vma)
+ return 0;
+
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+ if (memctlr_over_limit(vma->vm_mm->container) &&
+ ((container == vma->vm_mm->container) ||
+ (overlimit == SC_OVERLIMIT_ALL))) {
+ ret = 1;
+ break;
+ }
+ }
+ spin_unlock(&anon_vma->lock);
+ } else {
+ struct address_space *mapping = page_mapping(page);
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+
+ if (!mapping)
+ return 0;
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+ pgoff) {
+ if (memctlr_over_limit(vma->vm_mm->container) &&
+ ((container == vma->vm_mm->container) ||
+ (overlimit == SC_OVERLIMIT_ALL))) {
+ ret = 1;
+ break;
+ }
+ }
+ if (ret)
+ goto done;
+
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
+ shared.vm_set.list) {
+ if (memctlr_over_limit(vma->vm_mm->container) &&
+ ((container == vma->vm_mm->container) ||
+ (overlimit == SC_OVERLIMIT_ALL))) {
+ ret = 1;
+ break;
+ }
+ }
+done:
+ spin_unlock(&mapping->i_mmap_lock);
+ }
+
+ return ret;
+}
+#endif
+
/*
* Subfunctions of try_to_unmap: try_to_unmap_one called
* repeatedly from either try_to_unmap_anon or try_to_unmap_file.
diff -puN kernel/fork.c~container-memctlr-reclaim kernel/fork.c
--- linux-2.6.19-rc2/kernel/fork.c~container-memctlr-reclaim 2006-11-09 22:21:11.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/fork.c 2006-11-09 22:21:11.000000000 +0530
@@ -364,6 +364,7 @@ struct mm_struct * mm_alloc(void)
if (mm) {
memset(mm, 0, sizeof(*mm));
mm = mm_init(mm);
+ mm_assign_container(mm, current);
}
return mm;
}
_

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-09 19:37:14

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 4/8] RSS controller accounting



Account RSS usage of a task and the associated container. The definition
of RSS was debated and discussed in the following thread

http://lkml.org/lkml/2006/10/10/130


The code tracks all resident pages (including shared pages) as RSS. This patch
can easily adapt to the definition of RSS that will be agreed upon. This
implementation provides a proof of concept RSS controller.

The accounting is inspired from Rohit Seth's container patches.

TODO's

1. Merge file_rss and anon_rss tracking with the current rss tracking to
maximize code reuse
2. Add/remove RSS tracking as the definition of RSS evolves


Signed-off-by: Balbir Singh <[email protected]>
---

include/linux/memctlr.h | 26 +++++++++++
include/linux/rmap.h | 2
include/linux/sched.h | 11 +++++
kernel/fork.c | 6 ++
kernel/res_group/memctlr.c | 99 +++++++++++++++++++++++++++++++++++++++++++--
mm/rmap.c | 6 ++
6 files changed, 145 insertions(+), 5 deletions(-)

diff -puN include/linux/sched.h~container-memctlr-acct include/linux/sched.h
--- linux-2.6.19-rc2/include/linux/sched.h~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/sched.h 2006-11-09 21:46:22.000000000 +0530
@@ -88,6 +88,10 @@ struct sched_param {
struct exec_domain;
struct futex_pi_state;

+struct memctlr;
+struct container;
+struct mem_counter;
+
/*
* List of flags we want to share for kernel threads,
* if only because they are not used by them anyway.
@@ -355,6 +359,13 @@ struct mm_struct {
/* aio bits */
rwlock_t ioctx_list_lock;
struct kioctx *ioctx_list;
+#ifdef CONFIG_RES_GROUPS_MEMORY
+ struct container *container;
+ /*
+ * Try and merge anon and file rss accounting
+ */
+ struct mem_counter *counter;
+#endif
};

struct sighand_struct {
diff -puN kernel/res_group/memctlr.c~container-memctlr-acct kernel/res_group/memctlr.c
--- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c 2006-11-09 21:47:06.000000000 +0530
@@ -37,6 +37,8 @@ static struct resource_group *root_rgrou
static const char version[] = "0.01";
static struct memctlr *memctlr_root;

+#define MEMCTLR_MAGIC 0xdededede
+
struct mem_counter {
atomic_long_t rss;
};
@@ -49,6 +51,7 @@ struct memctlr {
/* Statistics */
int successes;
int failures;
+ int magic;
};

struct res_controller memctlr_rg;
@@ -66,12 +69,91 @@ static struct memctlr *get_memctlr(struc
&memctlr_rg));
}

+static void memctlr_init_mem_counter(struct mem_counter *counter)
+{
+ atomic_long_set(&counter->rss, 0);
+}
+
+int mm_init_mem_counter(struct mm_struct *mm)
+{
+ mm->counter = kmalloc(sizeof(struct mem_counter), GFP_KERNEL);
+ if (!mm->counter)
+ return -ENOMEM;
+ memctlr_init_mem_counter(mm->counter);
+ return 0;
+}
+
+void mm_free_mem_counter(struct mm_struct *mm)
+{
+ kfree(mm->counter);
+}
+
+void mm_assign_container(struct mm_struct *mm, struct task_struct *p)
+{
+ rcu_read_lock();
+ mm->container = rcu_dereference(p->container);
+ rcu_read_unlock();
+}
+
+static inline struct memctlr *get_memctlr_from_page(struct page *page)
+{
+ struct resource_group *rgroup;
+ struct memctlr *res;
+
+ /*
+ * Is the resource groups infrastructure initialized?
+ */
+ if (!memctlr_root)
+ return NULL;
+
+ rcu_read_lock();
+ rgroup = (struct resource_group *)rcu_dereference(current->container);
+ rcu_read_unlock();
+
+ res = get_memctlr(rgroup);
+ if (!res)
+ return NULL;
+
+ BUG_ON(res->magic != MEMCTLR_MAGIC);
+ return res;
+}
+
+
+void memctlr_inc_rss(struct page *page)
+{
+ struct memctlr *res;
+
+ res = get_memctlr_from_page(page);
+ if (!res)
+ return;
+
+ atomic_long_inc(&current->mm->counter->rss);
+ atomic_long_inc(&res->counter.rss);
+}
+
+void memctlr_dec_rss(struct page *page)
+{
+ struct memctlr *res;
+
+ res = get_memctlr_from_page(page);
+ if (!res)
+ return;
+
+ atomic_long_dec(&res->counter.rss);
+
+ if ((current->flags & PF_EXITING) && !current->mm)
+ return;
+ atomic_long_dec(&current->mm->counter->rss);
+}
+
static void memctlr_init_new(struct memctlr *res)
{
res->shares.min_shares = SHARE_DONT_CARE;
res->shares.max_shares = SHARE_DONT_CARE;
res->shares.child_shares_divisor = SHARE_DEFAULT_DIVISOR;
res->shares.unused_min_shares = SHARE_DEFAULT_DIVISOR;
+
+ memctlr_init_mem_counter(&res->counter);
}

static struct res_shares *memctlr_alloc_instance(struct resource_group *rgroup)
@@ -83,6 +165,7 @@ static struct res_shares *memctlr_alloc_
return NULL;
res->rgroup = rgroup;
memctlr_init_new(res);
+ res->magic = MEMCTLR_MAGIC;
if (is_res_group_root(rgroup)) {
root_rgroup = rgroup;
memctlr_root = res;
@@ -93,7 +176,7 @@ static struct res_shares *memctlr_alloc_

static void memctlr_free_instance(struct res_shares *shares)
{
- struct memctlr *res, *parres;
+ struct memctlr *res;

res = get_memctlr_from_shares(shares);
BUG_ON(!res);
@@ -108,12 +191,19 @@ static void memctlr_free_instance(struct
static ssize_t memctlr_show_stats(struct res_shares *shares, char *buf,
size_t len)
{
- int i = 0;
+ int i = 0, j = 0;
+ struct memctlr *res;
+
+ res = get_memctlr_from_shares(shares);
+ BUG_ON(!res);

- i += snprintf(buf, len, "Accounting will be added soon\n");
+ i = snprintf(buf, len, "RSS Pages %ld\n",
+ atomic_long_read(&res->counter.rss));
buf += i;
len -= i;
- return i;
+ j += i;
+
+ return j;
}

struct res_controller memctlr_rg = {
@@ -142,5 +232,6 @@ void __exit memctlr_exit(void)
BUG_ON(rc != 0);
}

+
module_init(memctlr_init);
module_exit(memctlr_exit);
diff -puN include/linux/memctlr.h~container-memctlr-acct include/linux/memctlr.h
--- linux-2.6.19-rc2/include/linux/memctlr.h~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/memctlr.h 2006-11-09 21:46:22.000000000 +0530
@@ -26,6 +26,32 @@

#ifdef CONFIG_RES_GROUPS_MEMORY
#include <linux/res_group_rc.h>
+
+extern int mm_init_mem_counter(struct mm_struct *mm);
+extern void mm_assign_container(struct mm_struct *mm, struct task_struct *p);
+extern void memctlr_inc_rss(struct page *page);
+extern void memctlr_dec_rss(struct page *page);
+extern void mm_free_mem_counter(struct mm_struct *mm);
+
+#else /* CONFIG_RES_GROUPS_MEMORY */
+
+void memctlr_inc_rss(struct page *page)
+{}
+
+void memctlr_dec_rss(struct page *page)
+{}
+
+int mm_init_mem_counter(struct mm_struct *mm)
+{
+ return 0;
+}
+
+void mm_assign_container(struct mm_struct *mm, struct task_struct *p)
+{}
+
+void mm_free_mem_counter(struct mm_struct *mm)
+{}
+
#endif /* CONFIG_RES_GROUPS_MEMORY */

#endif /* _LINUX_MEMCTRL_H */
diff -puN kernel/fork.c~container-memctlr-acct kernel/fork.c
--- linux-2.6.19-rc2/kernel/fork.c~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/fork.c 2006-11-09 21:46:22.000000000 +0530
@@ -49,6 +49,7 @@
#include <linux/taskstats_kern.h>
#include <linux/random.h>
#include <linux/numtasks.h>
+#include <linux/memctlr.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -340,11 +341,14 @@ static struct mm_struct * mm_init(struct
mm->ioctx_list = NULL;
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
+ if (mm_init_mem_counter(mm) < 0)
+ goto mem_fail;

if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
return mm;
}
+mem_fail:
free_mm(mm);
return NULL;
}
@@ -372,6 +376,7 @@ struct mm_struct * mm_alloc(void)
void fastcall __mmdrop(struct mm_struct *mm)
{
BUG_ON(mm == &init_mm);
+ mm_free_mem_counter(mm);
mm_free_pgd(mm);
destroy_context(mm);
free_mm(mm);
@@ -544,6 +549,7 @@ static int copy_mm(unsigned long clone_f

good_mm:
tsk->mm = mm;
+ mm_assign_container(mm, tsk);
tsk->active_mm = mm;
return 0;

diff -puN mm/rmap.c~container-memctlr-acct mm/rmap.c
--- linux-2.6.19-rc2/mm/rmap.c~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/mm/rmap.c 2006-11-09 21:46:22.000000000 +0530
@@ -537,6 +537,7 @@ void page_add_anon_rmap(struct page *pag
if (atomic_inc_and_test(&page->_mapcount))
__page_set_anon_rmap(page, vma, address);
/* else checking page index and mapping is racy */
+ memctlr_inc_rss(page);
}

/*
@@ -553,6 +554,7 @@ void page_add_new_anon_rmap(struct page
{
atomic_set(&page->_mapcount, 0); /* elevate count by 1 (starts at -1) */
__page_set_anon_rmap(page, vma, address);
+ memctlr_inc_rss(page);
}

/**
@@ -565,6 +567,7 @@ void page_add_file_rmap(struct page *pag
{
if (atomic_inc_and_test(&page->_mapcount))
__inc_zone_page_state(page, NR_FILE_MAPPED);
+ memctlr_inc_rss(page);
}

/**
@@ -596,8 +599,9 @@ void page_remove_rmap(struct page *page)
if (page_test_and_clear_dirty(page))
set_page_dirty(page);
__dec_zone_page_state(page,
- PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
+ PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
}
+ memctlr_dec_rss(page, mm);
}

/*
diff -puN include/linux/rmap.h~container-memctlr-acct include/linux/rmap.h
--- linux-2.6.19-rc2/include/linux/rmap.h~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
+++ linux-2.6.19-rc2-balbir/include/linux/rmap.h 2006-11-09 21:46:22.000000000 +0530
@@ -8,6 +8,7 @@
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/spinlock.h>
+#include <linux/memctlr.h>

/*
* The anon_vma heads a list of private "related" vmas, to scan if
@@ -84,6 +85,7 @@ void page_remove_rmap(struct page *);
static inline void page_dup_rmap(struct page *page)
{
atomic_inc(&page->_mapcount);
+ memctlr_inc_rss(page);
}

/*
_

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-09 19:36:48

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 6/8] RSS controller shares allocation



Support shares assignment and propagation.

Signed-off-by: Balbir Singh <[email protected]>
---

kernel/res_group/memctlr.c | 59 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 58 insertions(+), 1 deletion(-)

diff -puN kernel/res_group/memctlr.c~container-memctlr-shares kernel/res_group/memctlr.c
--- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-shares 2006-11-09 22:20:28.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c 2006-11-09 22:20:28.000000000 +0530
@@ -32,6 +32,7 @@
#include <linux/res_group_rc.h>
#include <linux/memctlr.h>
#include <linux/mm.h>
+#include <linux/swap.h>
#include <asm/pgtable.h>

static const char res_ctlr_name[] = "memctlr";
@@ -55,6 +56,7 @@ struct memctlr {
int failures;
int magic;
spinlock_t lock;
+ long nr_pages;
};

struct res_controller memctlr_rg;
@@ -180,6 +182,7 @@ static void memctlr_init_new(struct memc
res->shares.unused_min_shares = SHARE_DEFAULT_DIVISOR;

memctlr_init_mem_counter(&res->counter);
+ res->nr_pages = SHARE_DONT_CARE;
spin_lock_init(&res->lock);
}

@@ -196,6 +199,7 @@ static struct res_shares *memctlr_alloc_
if (is_res_group_root(rgroup)) {
root_rgroup = rgroup;
memctlr_root = res;
+ res->nr_pages = nr_free_pages();
printk("Memory Controller version %s\n", version);
}
return &res->shares;
@@ -346,6 +350,11 @@ static ssize_t memctlr_show_stats(struct
len -= i;
j += i;

+ i = snprintf(buf, len, "Max Allowed Pages %ld\n", res->nr_pages);
+
+ buf += i;
+ len -= i;
+ j += i;
return j;
}

@@ -406,13 +415,61 @@ static void memctlr_move_task(struct tas
double_res_unlock(oldres, newres);
}

+static void recalc_and_propagate(struct memctlr *res, struct memctlr *parres)
+{
+ struct resource_group *child = NULL;
+ int child_divisor;
+ u64 numerator;
+ struct memctlr *child_res;
+
+ if (parres) {
+ if (res->shares.max_shares == SHARE_DONT_CARE ||
+ parres->shares.max_shares == SHARE_DONT_CARE)
+ return;
+
+ child_divisor = parres->shares.child_shares_divisor;
+ if (child_divisor == 0)
+ return;
+
+ numerator = (u64)(parres->shares.unused_min_shares *
+ res->shares.max_shares);
+ do_div(numerator, child_divisor);
+ numerator = (u64)(parres->nr_pages * numerator);
+ do_div(numerator, SHARE_DEFAULT_DIVISOR);
+ res->nr_pages = numerator;
+ }
+
+ for_each_child(child, res->rgroup) {
+ child_res = get_memctlr(child);
+ BUG_ON(!child_res);
+ recalc_and_propagate(child_res, res);
+ }
+
+}
+
+static void memctlr_shares_changed(struct res_shares *shares)
+{
+ struct memctlr *res, *parres;
+
+ res = get_memctlr_from_shares(shares);
+ if (!res)
+ return;
+
+ if (is_res_group_root(res->rgroup))
+ parres = NULL;
+ else
+ parres = get_memctlr((struct container *)res->rgroup->parent);
+
+ recalc_and_propagate(res, parres);
+}
+
struct res_controller memctlr_rg = {
.name = res_ctlr_name,
.ctlr_id = NO_RES_ID,
.alloc_shares_struct = memctlr_alloc_instance,
.free_shares_struct = memctlr_free_instance,
.move_task = memctlr_move_task,
- .shares_changed = NULL,
+ .shares_changed = memctlr_shares_changed,
.show_stats = memctlr_show_stats,
};

_

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-09 19:37:15

by Balbir Singh

[permalink] [raw]
Subject: [RFC][PATCH 7/8] RSS controller fix resource groups parsing



echo adds a "\n" to the end of a string. When this string is copied from
user space, we need to remove it, so that match_token() can parse
the user space string correctly

Signed-off-by: Balbir Singh <[email protected]>
---

kernel/res_group/rgcs.c | 6 ++++++
1 file changed, 6 insertions(+)

diff -puN kernel/res_group/rgcs.c~container-res-groups-fix-parsing kernel/res_group/rgcs.c
--- linux-2.6.19-rc2/kernel/res_group/rgcs.c~container-res-groups-fix-parsing 2006-11-09 23:08:10.000000000 +0530
+++ linux-2.6.19-rc2-balbir/kernel/res_group/rgcs.c 2006-11-09 23:08:10.000000000 +0530
@@ -241,6 +241,12 @@ ssize_t res_group_file_write(struct cont
}
buf[nbytes] = 0; /* nul-terminate */

+ /*
+ * Ignore "\n". It might come in from echo(1)
+ */
+ if (buf[nbytes - 1] == '\n')
+ buf[nbytes - 1] = 0;
+
container_manage_lock();

if (container_is_removed(cont)) {
_

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-09 19:45:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] RSS controller support reclamation

On Fri, 2006-11-10 at 01:06 +0530, Balbir Singh wrote:
>
> Reclaim memory as we hit the max_shares limit. The code for reclamation
> is inspired from Dave Hansen's challenged memory controller and from the
> shrink_all_memory() code


Hmm.. I seem to remember that all previous RSS rlimit attempts actually
fell flat on their face because of the reclaim-on-rss-overflow behavior;
in the shared page / cached page (equally important!) case, it means
process A (or container A) suddenly penalizes process B (or container B)
by making B have pagecache misses because A was using a low RSS limit.

Unmapping the page makes sense, sure, and even moving then to inactive
lists or whatever that is called in the vm today, but reclaim... that's
expensive...


2006-11-10 01:56:39

by Balbir Singh

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC][PATCH 8/8] RSS controller support reclamation

On 11/10/06, Arjan van de Ven <[email protected]> wrote:
> On Fri, 2006-11-10 at 01:06 +0530, Balbir Singh wrote:
> >
> > Reclaim memory as we hit the max_shares limit. The code for reclamation
> > is inspired from Dave Hansen's challenged memory controller and from the
> > shrink_all_memory() code
>
>
> Hmm.. I seem to remember that all previous RSS rlimit attempts actually
> fell flat on their face because of the reclaim-on-rss-overflow behavior;
> in the shared page / cached page (equally important!) case, it means
> process A (or container A) suddenly penalizes process B (or container B)
> by making B have pagecache misses because A was using a low RSS limit.
>
> Unmapping the page makes sense, sure, and even moving then to inactive
> lists or whatever that is called in the vm today, but reclaim... that's
> expensive...
>

I see your point, one of things we could do is that we could track
shared and cached pages separately and not be so severe on them.

I'll play around with this idea and see what I come up with.

Thanks for the feedback,
Balbir

2006-11-10 08:59:42

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] RSS controller support reclamation

Balbir Singh wrote:
> Reclaim memory as we hit the max_shares limit. The code for reclamation
> is inspired from Dave Hansen's challenged memory controller and from the
> shrink_all_memory() code
>
> Reclamation can be triggered from two paths
>
> 1. While incrementing the RSS, we hit the limit of the container
> 2. A container is resized, such that it's new limit is below its current
> RSS
>
> In (1) reclamation takes place in the background.

Hmm... This is not a hard limit in this case, right? And in case
of overloaded system from the moment reclamation thread is woken
up till the moment it starts shrinking zones container may touch
too many pages...

That's not good.

> TODO's
>
> 1. max_shares currently works like a soft limit. The RSS can grow beyond it's
> limit. One possible fix is to introduce a soft limit (reclaim when the
> container hits the soft limit) and fail when we hit the hard limit

Such soft limit doesn't help also. It just makes effects on
low-loaded system smoother.

And what about a hard limit - how would you fail in page fault in
case of limit hit? SIGKILL/SEGV is not an option - in this case we
should run synchronous reclamation. This is done in beancounter
patches v6 we've sent recently.

> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> --- linux-2.6.19-rc2/mm/vmscan.c~container-memctlr-reclaim 2006-11-09 22:21:11.000000000 +0530
> +++ linux-2.6.19-rc2-balbir/mm/vmscan.c 2006-11-09 22:21:11.000000000 +0530
> @@ -36,6 +36,8 @@
> #include <linux/rwsem.h>
> #include <linux/delay.h>
> #include <linux/kthread.h>
> +#include <linux/container.h>
> +#include <linux/memctlr.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -65,6 +67,9 @@ struct scan_control {
> int swappiness;
>
> int all_unreclaimable;
> +
> + int overlimit;
> + void *container; /* Added as void * to avoid #ifdef's */
> };
>
> /*
> @@ -811,6 +816,10 @@ force_reclaim_mapped:
> cond_resched();
> page = lru_to_page(&l_hold);
> list_del(&page->lru);
> + if (!memctlr_page_reclaim(page, sc->container, sc->overlimit)) {
> + list_add(&page->lru, &l_active);
> + continue;
> + }
> if (page_mapped(page)) {
> if (!reclaim_mapped ||
> (total_swap_pages == 0 && PageAnon(page)) ||

[snip] See comment below.

>
> +#ifdef CONFIG_RES_GROUPS_MEMORY
> +/*
> + * Modelled after shrink_all_memory
> + */
> +unsigned long memctlr_shrink_container_memory(unsigned long nr_pages,
> + struct container *container,
> + int overlimit)
> +{
> + unsigned long lru_pages;
> + unsigned long ret = 0;
> + int pass;
> + struct zone *zone;
> + struct scan_control sc = {
> + .gfp_mask = GFP_KERNEL,
> + .may_swap = 0,
> + .swap_cluster_max = nr_pages,
> + .may_writepage = 1,
> + .swappiness = vm_swappiness,
> + .overlimit = overlimit,
> + .container = container,
> + };
> +

[snip]

> + for (prio = DEF_PRIORITY; prio >= 0; prio--) {
> + unsigned long nr_to_scan = nr_pages - ret;
> +
> + sc.nr_scanned = 0;
> + ret += shrink_all_zones(nr_to_scan, prio, pass, &sc);
> + if (ret >= nr_pages)
> + break;
> +
> + if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
> + blk_congestion_wait(WRITE, HZ / 10);
> + }
> + }
> + return ret;
> +}
> +#endif

Please correct me if I'm wrong, but does this reclamation work like
"run over all the zones' lists searching for page whose controller
is sc->container" ?

[snip]

2006-11-10 09:11:43

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/8] RSS controller accounting

Balbir Singh wrote:
> Account RSS usage of a task and the associated container. The definition
> of RSS was debated and discussed in the following thread
>
> http://lkml.org/lkml/2006/10/10/130
>
>
> The code tracks all resident pages (including shared pages) as RSS. This patch
> can easily adapt to the definition of RSS that will be agreed upon. This
> implementation provides a proof of concept RSS controller.
>
> The accounting is inspired from Rohit Seth's container patches.
>
> TODO's
>
> 1. Merge file_rss and anon_rss tracking with the current rss tracking to
> maximize code reuse
> 2. Add/remove RSS tracking as the definition of RSS evolves
>
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
>

[snip]

> --- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
> +++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c 2006-11-09 21:47:06.000000000 +0530
> @@ -37,6 +37,8 @@ static struct resource_group *root_rgrou
> static const char version[] = "0.01";
> static struct memctlr *memctlr_root;
>
> +#define MEMCTLR_MAGIC 0xdededede
> +
> struct mem_counter {
> atomic_long_t rss;
> };
> @@ -49,6 +51,7 @@ struct memctlr {
> /* Statistics */
> int successes;
> int failures;
> + int magic;

What is this magic for? Is it just for debugging?

[snip]

> +static inline struct memctlr *get_memctlr_from_page(struct page *page)
> +{
> + struct resource_group *rgroup;
> + struct memctlr *res;
> +
> + /*
> + * Is the resource groups infrastructure initialized?
> + */
> + if (!memctlr_root)
> + return NULL;
> +
> + rcu_read_lock();
> + rgroup = (struct resource_group *)rcu_dereference(current->container);
> + rcu_read_unlock();
> +
> + res = get_memctlr(rgroup);
> + if (!res)
> + return NULL;
> +
> + BUG_ON(res->magic != MEMCTLR_MAGIC);
> + return res;
> +}

I don't see how page passed to this function is involved into
'struct memctlr *res' determining. Could you comment this?

[snip]

> --- linux-2.6.19-rc2/mm/rmap.c~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
> +++ linux-2.6.19-rc2-balbir/mm/rmap.c 2006-11-09 21:46:22.000000000 +0530
> @@ -537,6 +537,7 @@ void page_add_anon_rmap(struct page *pag
> if (atomic_inc_and_test(&page->_mapcount))
> __page_set_anon_rmap(page, vma, address);
> /* else checking page index and mapping is racy */
> + memctlr_inc_rss(page);
> }
>
> /*
> @@ -553,6 +554,7 @@ void page_add_new_anon_rmap(struct page
> {
> atomic_set(&page->_mapcount, 0); /* elevate count by 1 (starts at -1) */
> __page_set_anon_rmap(page, vma, address);
> + memctlr_inc_rss(page);
> }
>
> /**
> @@ -565,6 +567,7 @@ void page_add_file_rmap(struct page *pag
> {
> if (atomic_inc_and_test(&page->_mapcount))
> __inc_zone_page_state(page, NR_FILE_MAPPED);
> + memctlr_inc_rss(page);

Consider a task maps one file page 100 times in different places
and touches 'all of them'. In this case I see that you'll get
100 in rss counter while real rss will be just 1.

> }
>
> /**
> @@ -596,8 +599,9 @@ void page_remove_rmap(struct page *page)
> if (page_test_and_clear_dirty(page))
> set_page_dirty(page);
> __dec_zone_page_state(page,
> - PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
> + PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);

What is this extra space after a question-mark for?

> }
> + memctlr_dec_rss(page, mm);
> }
>
> /*
> diff -puN include/linux/rmap.h~container-memctlr-acct include/linux/rmap.h
> --- linux-2.6.19-rc2/include/linux/rmap.h~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
> +++ linux-2.6.19-rc2-balbir/include/linux/rmap.h 2006-11-09 21:46:22.000000000 +0530
> @@ -8,6 +8,7 @@
> #include <linux/slab.h>
> #include <linux/mm.h>
> #include <linux/spinlock.h>
> +#include <linux/memctlr.h>
>
> /*
> * The anon_vma heads a list of private "related" vmas, to scan if
> @@ -84,6 +85,7 @@ void page_remove_rmap(struct page *);
> static inline void page_dup_rmap(struct page *page)
> {
> atomic_inc(&page->_mapcount);
> + memctlr_inc_rss(page);
> }

I'm not sure this is correct. page_dup_rmap() happens in the context
of forking process and thus you'll increment rss counter on current.
But this must be incremented at new task's counter, mustn't it?

2006-11-10 09:16:30

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][PATCH 6/8] RSS controller shares allocation

Balbir Singh wrote:
> Support shares assignment and propagation.
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> kernel/res_group/memctlr.c | 59 ++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 58 insertions(+), 1 deletion(-)

[snip]

> +static void recalc_and_propagate(struct memctlr *res, struct memctlr *parres)
> +{
> + struct resource_group *child = NULL;
> + int child_divisor;
> + u64 numerator;
> + struct memctlr *child_res;
> +
> + if (parres) {
> + if (res->shares.max_shares == SHARE_DONT_CARE ||
> + parres->shares.max_shares == SHARE_DONT_CARE)
> + return;
> +
> + child_divisor = parres->shares.child_shares_divisor;
> + if (child_divisor == 0)
> + return;
> +
> + numerator = (u64)(parres->shares.unused_min_shares *
> + res->shares.max_shares);
> + do_div(numerator, child_divisor);
> + numerator = (u64)(parres->nr_pages * numerator);
> + do_div(numerator, SHARE_DEFAULT_DIVISOR);
> + res->nr_pages = numerator;
> + }
> +
> + for_each_child(child, res->rgroup) {
> + child_res = get_memctlr(child);
> + BUG_ON(!child_res);
> + recalc_and_propagate(child_res, res);

Recursion? Won't it eat all the stack in case of a deep tree?

> + }
> +
> +}
> +
> +static void memctlr_shares_changed(struct res_shares *shares)
> +{
> + struct memctlr *res, *parres;
> +
> + res = get_memctlr_from_shares(shares);
> + if (!res)
> + return;
> +
> + if (is_res_group_root(res->rgroup))
> + parres = NULL;
> + else
> + parres = get_memctlr((struct container *)res->rgroup->parent);
> +
> + recalc_and_propagate(res, parres);
> +}
> +
> struct res_controller memctlr_rg = {
> .name = res_ctlr_name,
> .ctlr_id = NO_RES_ID,
> .alloc_shares_struct = memctlr_alloc_instance,
> .free_shares_struct = memctlr_free_instance,
> .move_task = memctlr_move_task,
> - .shares_changed = NULL,
> + .shares_changed = memctlr_shares_changed,

I didn't find where in this patches this callback is called.

2006-11-10 09:16:54

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] RSS controller support reclamation

Pavel Emelianov wrote:
> Balbir Singh wrote:
>> Reclaim memory as we hit the max_shares limit. The code for reclamation
>> is inspired from Dave Hansen's challenged memory controller and from the
>> shrink_all_memory() code
>>
>> Reclamation can be triggered from two paths
>>
>> 1. While incrementing the RSS, we hit the limit of the container
>> 2. A container is resized, such that it's new limit is below its current
>> RSS
>>
>> In (1) reclamation takes place in the background.
>
> Hmm... This is not a hard limit in this case, right? And in case
> of overloaded system from the moment reclamation thread is woken
> up till the moment it starts shrinking zones container may touch
> too many pages...
>
> That's not good.

Yes, please see my comments in the TODO's. Hard limits should be easy
to implement, it's a question of calling the correct routine based
on policy.

>
>> TODO's
>>
>> 1. max_shares currently works like a soft limit. The RSS can grow beyond it's
>> limit. One possible fix is to introduce a soft limit (reclaim when the
>> container hits the soft limit) and fail when we hit the hard limit
>
> Such soft limit doesn't help also. It just makes effects on
> low-loaded system smoother.
>
> And what about a hard limit - how would you fail in page fault in
> case of limit hit? SIGKILL/SEGV is not an option - in this case we
> should run synchronous reclamation. This is done in beancounter
> patches v6 we've sent recently.
>

I thought about running synchronous reclamation, but then did not follow
that approach, I was not sure if calling the reclaim routines from the
page fault context is a good thing to do. It's worth trying out, since
it would provide better control over rss.


>> Signed-off-by: Balbir Singh <[email protected]>
>> ---
>>
>> --- linux-2.6.19-rc2/mm/vmscan.c~container-memctlr-reclaim 2006-11-09 22:21:11.000000000 +0530
>> +++ linux-2.6.19-rc2-balbir/mm/vmscan.c 2006-11-09 22:21:11.000000000 +0530
>> @@ -36,6 +36,8 @@
>> #include <linux/rwsem.h>
>> #include <linux/delay.h>
>> #include <linux/kthread.h>
>> +#include <linux/container.h>
>> +#include <linux/memctlr.h>
>>
>> #include <asm/tlbflush.h>
>> #include <asm/div64.h>
>> @@ -65,6 +67,9 @@ struct scan_control {
>> int swappiness;
>>
>> int all_unreclaimable;
>> +
>> + int overlimit;
>> + void *container; /* Added as void * to avoid #ifdef's */
>> };
>>
>> /*
>> @@ -811,6 +816,10 @@ force_reclaim_mapped:
>> cond_resched();
>> page = lru_to_page(&l_hold);
>> list_del(&page->lru);
>> + if (!memctlr_page_reclaim(page, sc->container, sc->overlimit)) {
>> + list_add(&page->lru, &l_active);
>> + continue;
>> + }
>> if (page_mapped(page)) {
>> if (!reclaim_mapped ||
>> (total_swap_pages == 0 && PageAnon(page)) ||
>
> [snip] See comment below.
>
>>
>> +#ifdef CONFIG_RES_GROUPS_MEMORY
>> +/*
>> + * Modelled after shrink_all_memory
>> + */
>> +unsigned long memctlr_shrink_container_memory(unsigned long nr_pages,
>> + struct container *container,
>> + int overlimit)
>> +{
>> + unsigned long lru_pages;
>> + unsigned long ret = 0;
>> + int pass;
>> + struct zone *zone;
>> + struct scan_control sc = {
>> + .gfp_mask = GFP_KERNEL,
>> + .may_swap = 0,
>> + .swap_cluster_max = nr_pages,
>> + .may_writepage = 1,
>> + .swappiness = vm_swappiness,
>> + .overlimit = overlimit,
>> + .container = container,
>> + };
>> +
>
> [snip]
>
>> + for (prio = DEF_PRIORITY; prio >= 0; prio--) {
>> + unsigned long nr_to_scan = nr_pages - ret;
>> +
>> + sc.nr_scanned = 0;
>> + ret += shrink_all_zones(nr_to_scan, prio, pass, &sc);
>> + if (ret >= nr_pages)
>> + break;
>> +
>> + if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
>> + blk_congestion_wait(WRITE, HZ / 10);
>> + }
>> + }
>> + return ret;
>> +}
>> +#endif
>
> Please correct me if I'm wrong, but does this reclamation work like
> "run over all the zones' lists searching for page whose controller
> is sc->container" ?
>

Yeah, that's correct. The code can also reclaim memory from all over-the-limit
containers (by passing SC_OVERLIMIT_ALL). The idea behind using such a scheme
is to ensure that the global LRU list is not broken.


--
Thanks for the feedback,
Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-10 09:18:02

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/8] RSS controller fix resource groups parsing

Balbir Singh wrote:
> echo adds a "\n" to the end of a string. When this string is copied from
> user space, we need to remove it, so that match_token() can parse
> the user space string correctly
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> kernel/res_group/rgcs.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff -puN kernel/res_group/rgcs.c~container-res-groups-fix-parsing kernel/res_group/rgcs.c
> --- linux-2.6.19-rc2/kernel/res_group/rgcs.c~container-res-groups-fix-parsing 2006-11-09 23:08:10.000000000 +0530
> +++ linux-2.6.19-rc2-balbir/kernel/res_group/rgcs.c 2006-11-09 23:08:10.000000000 +0530
> @@ -241,6 +241,12 @@ ssize_t res_group_file_write(struct cont
> }
> buf[nbytes] = 0; /* nul-terminate */
>
> + /*
> + * Ignore "\n". It might come in from echo(1)

Why not inform user he should call echo -n?

> + */
> + if (buf[nbytes - 1] == '\n')
> + buf[nbytes - 1] = 0;
> +
> container_manage_lock();
>
> if (container_is_removed(cont)) {
> _
>

That's the same patch as in [PATCH 1/8] mail. Did you attached
a wrong one?

2006-11-10 09:32:46

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/8] RSS controller fix resource groups parsing

Pavel Emelianov wrote:
> Balbir Singh wrote:
>> echo adds a "\n" to the end of a string. When this string is copied from
>> user space, we need to remove it, so that match_token() can parse
>> the user space string correctly
>>
>> Signed-off-by: Balbir Singh <[email protected]>
>> ---
>>
>> kernel/res_group/rgcs.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff -puN kernel/res_group/rgcs.c~container-res-groups-fix-parsing kernel/res_group/rgcs.c
>> --- linux-2.6.19-rc2/kernel/res_group/rgcs.c~container-res-groups-fix-parsing 2006-11-09 23:08:10.000000000 +0530
>> +++ linux-2.6.19-rc2-balbir/kernel/res_group/rgcs.c 2006-11-09 23:08:10.000000000 +0530
>> @@ -241,6 +241,12 @@ ssize_t res_group_file_write(struct cont
>> }
>> buf[nbytes] = 0; /* nul-terminate */
>>
>> + /*
>> + * Ignore "\n". It might come in from echo(1)
>
> Why not inform user he should call echo -n?

Yes, but what if the user does not use it? We can't afford to do the
wrong thing. But it's a good point, I'll document and recommend that
the users use echo -n.


>
>> + */
>> + if (buf[nbytes - 1] == '\n')
>> + buf[nbytes - 1] = 0;
>> +
>> container_manage_lock();
>>
>> if (container_is_removed(cont)) {
>> _
>>
>
> That's the same patch as in [PATCH 1/8] mail. Did you attached
> a wrong one?

Yeah... I moved this patch from #7 to #1 and did not remove it.
Sorry!

--
Thanks,
Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-10 09:33:37

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] RSS controller support reclamation

Balbir Singh wrote:

[snip]

>> And what about a hard limit - how would you fail in page fault in
>> case of limit hit? SIGKILL/SEGV is not an option - in this case we
>> should run synchronous reclamation. This is done in beancounter
>> patches v6 we've sent recently.
>>
>
> I thought about running synchronous reclamation, but then did not follow
> that approach, I was not sure if calling the reclaim routines from the
> page fault context is a good thing to do. It's worth trying out, since

Each page fault potentially calls reclamation by allocating
required page with __GFP_IO | __GFP_FS bits set. Synchronous
reclamation in page fault is really normal.

[snip]

>> Please correct me if I'm wrong, but does this reclamation work like
>> "run over all the zones' lists searching for page whose controller
>> is sc->container" ?
>>
>
> Yeah, that's correct. The code can also reclaim memory from all over-the-limit

OK. What if I have a container with 100 pages limit in a 4Gb
(~ million of pages) machine and this group starts reclaiming
its pages. In case this group uses its pages heavily they will
be at the beginning of an LRU list and reclamation code would
have to scan through all (million) pages before it finds proper
ones. This is not optimal!

> containers (by passing SC_OVERLIMIT_ALL). The idea behind using such a scheme
> is to ensure that the global LRU list is not broken.

isolate_lru_pages() helps in this. As far as I remember this
was introduced to reduce lru lock contention and keep lru
lists integrity.

In beancounters patches this is used to shrink BC's pages.

2006-11-10 09:52:58

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/8] RSS controller accounting

Pavel Emelianov wrote:
> Balbir Singh wrote:
>> Account RSS usage of a task and the associated container. The definition
>> of RSS was debated and discussed in the following thread
>>
>> http://lkml.org/lkml/2006/10/10/130
>>
>>
>> The code tracks all resident pages (including shared pages) as RSS. This patch
>> can easily adapt to the definition of RSS that will be agreed upon. This
>> implementation provides a proof of concept RSS controller.
>>
>> The accounting is inspired from Rohit Seth's container patches.
>>
>> TODO's
>>
>> 1. Merge file_rss and anon_rss tracking with the current rss tracking to
>> maximize code reuse
>> 2. Add/remove RSS tracking as the definition of RSS evolves
>>
>>
>> Signed-off-by: Balbir Singh <[email protected]>
>> ---
>>
>
> [snip]
>
>> --- linux-2.6.19-rc2/kernel/res_group/memctlr.c~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
>> +++ linux-2.6.19-rc2-balbir/kernel/res_group/memctlr.c 2006-11-09 21:47:06.000000000 +0530
>> @@ -37,6 +37,8 @@ static struct resource_group *root_rgrou
>> static const char version[] = "0.01";
>> static struct memctlr *memctlr_root;
>>
>> +#define MEMCTLR_MAGIC 0xdededede
>> +
>> struct mem_counter {
>> atomic_long_t rss;
>> };
>> @@ -49,6 +51,7 @@ struct memctlr {
>> /* Statistics */
>> int successes;
>> int failures;
>> + int magic;
>
> What is this magic for? Is it just for debugging?
>

Yes

> [snip]
>
>> +static inline struct memctlr *get_memctlr_from_page(struct page *page)
>> +{
>> + struct resource_group *rgroup;
>> + struct memctlr *res;
>> +
>> + /*
>> + * Is the resource groups infrastructure initialized?
>> + */
>> + if (!memctlr_root)
>> + return NULL;
>> +
>> + rcu_read_lock();
>> + rgroup = (struct resource_group *)rcu_dereference(current->container);
>> + rcu_read_unlock();
>> +
>> + res = get_memctlr(rgroup);
>> + if (!res)
>> + return NULL;
>> +
>> + BUG_ON(res->magic != MEMCTLR_MAGIC);
>> + return res;
>> +}
>
> I don't see how page passed to this function is involved into
> 'struct memctlr *res' determining. Could you comment this?
>

Yeah, from page is a misnomer. We just use the current task task.
I'll fix the naming convention


> [snip]
>
>> --- linux-2.6.19-rc2/mm/rmap.c~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
>> +++ linux-2.6.19-rc2-balbir/mm/rmap.c 2006-11-09 21:46:22.000000000 +0530
>> @@ -537,6 +537,7 @@ void page_add_anon_rmap(struct page *pag
>> if (atomic_inc_and_test(&page->_mapcount))
>> __page_set_anon_rmap(page, vma, address);
>> /* else checking page index and mapping is racy */
>> + memctlr_inc_rss(page);
>> }
>>
>> /*
>> @@ -553,6 +554,7 @@ void page_add_new_anon_rmap(struct page
>> {
>> atomic_set(&page->_mapcount, 0); /* elevate count by 1 (starts at -1) */
>> __page_set_anon_rmap(page, vma, address);
>> + memctlr_inc_rss(page);
>> }
>>
>> /**
>> @@ -565,6 +567,7 @@ void page_add_file_rmap(struct page *pag
>> {
>> if (atomic_inc_and_test(&page->_mapcount))
>> __inc_zone_page_state(page, NR_FILE_MAPPED);
>> + memctlr_inc_rss(page);
>
> Consider a task maps one file page 100 times in different places
> and touches 'all of them'. In this case I see that you'll get
> 100 in rss counter while real rss will be just 1.
>

Hmmm... something for me to think about. Depending on how we define
RSS, the code for accounting should be easy to add & modify depending
on how we define RSS. But you bring up a very good point.

>> }
>>
>> /**
>> @@ -596,8 +599,9 @@ void page_remove_rmap(struct page *page)
>> if (page_test_and_clear_dirty(page))
>> set_page_dirty(page);
>> __dec_zone_page_state(page,
>> - PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
>> + PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
>
> What is this extra space after a question-mark for?

This is again something I changed and looks my undo was not very good.
Please ignore it, I'll remove it from the diff.

>
>> }
>> + memctlr_dec_rss(page, mm);
>> }
>>
>> /*
>> diff -puN include/linux/rmap.h~container-memctlr-acct include/linux/rmap.h
>> --- linux-2.6.19-rc2/include/linux/rmap.h~container-memctlr-acct 2006-11-09 21:46:22.000000000 +0530
>> +++ linux-2.6.19-rc2-balbir/include/linux/rmap.h 2006-11-09 21:46:22.000000000 +0530
>> @@ -8,6 +8,7 @@
>> #include <linux/slab.h>
>> #include <linux/mm.h>
>> #include <linux/spinlock.h>
>> +#include <linux/memctlr.h>
>>
>> /*
>> * The anon_vma heads a list of private "related" vmas, to scan if
>> @@ -84,6 +85,7 @@ void page_remove_rmap(struct page *);
>> static inline void page_dup_rmap(struct page *page)
>> {
>> atomic_inc(&page->_mapcount);
>> + memctlr_inc_rss(page);
>> }
>
> I'm not sure this is correct. page_dup_rmap() happens in the context
> of forking process and thus you'll increment rss counter on current.
> But this must be incremented at new task's counter, mustn't it?

This is fixed in the next patch container-memctlr-task-migration.
Thanks for spotting it.

--
Thanks,
Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-10 10:28:29

by Balbir Singh

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC][PATCH 6/8] RSS controller shares allocation

Pavel Emelianov wrote:
> Balbir Singh wrote:
>> Support shares assignment and propagation.
>>
>> Signed-off-by: Balbir Singh <[email protected]>
>> ---
>>
>> kernel/res_group/memctlr.c | 59 ++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 58 insertions(+), 1 deletion(-)
>
> [snip]
>
>> +static void recalc_and_propagate(struct memctlr *res, struct memctlr *parres)
>> +{
>> + struct resource_group *child = NULL;
>> + int child_divisor;
>> + u64 numerator;
>> + struct memctlr *child_res;
>> +
>> + if (parres) {
>> + if (res->shares.max_shares == SHARE_DONT_CARE ||
>> + parres->shares.max_shares == SHARE_DONT_CARE)
>> + return;
>> +
>> + child_divisor = parres->shares.child_shares_divisor;
>> + if (child_divisor == 0)
>> + return;
>> +
>> + numerator = (u64)(parres->shares.unused_min_shares *
>> + res->shares.max_shares);
>> + do_div(numerator, child_divisor);
>> + numerator = (u64)(parres->nr_pages * numerator);
>> + do_div(numerator, SHARE_DEFAULT_DIVISOR);
>> + res->nr_pages = numerator;
>> + }
>> +
>> + for_each_child(child, res->rgroup) {
>> + child_res = get_memctlr(child);
>> + BUG_ON(!child_res);
>> + recalc_and_propagate(child_res, res);
>
> Recursion? Won't it eat all the stack in case of a deep tree?

The depth of the hierarchy can be controlled. Recursion is needed
to do a DFS walk

>
>> + }
>> +
>> +}
>> +
>> +static void memctlr_shares_changed(struct res_shares *shares)
>> +{
>> + struct memctlr *res, *parres;
>> +
>> + res = get_memctlr_from_shares(shares);
>> + if (!res)
>> + return;
>> +
>> + if (is_res_group_root(res->rgroup))
>> + parres = NULL;
>> + else
>> + parres = get_memctlr((struct container *)res->rgroup->parent);
>> +
>> + recalc_and_propagate(res, parres);
>> +}
>> +
>> struct res_controller memctlr_rg = {
>> .name = res_ctlr_name,
>> .ctlr_id = NO_RES_ID,
>> .alloc_shares_struct = memctlr_alloc_instance,
>> .free_shares_struct = memctlr_free_instance,
>> .move_task = memctlr_move_task,
>> - .shares_changed = NULL,
>> + .shares_changed = memctlr_shares_changed,
>
> I didn't find where in this patches this callback is called.

It's a part of the resource groups infrastructure. It's been ported
on top of Paul Menage's containers patches. The code can be easily
adapted to work directly with containers instead of resource groups
if required.



--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-10 10:36:35

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC][PATCH 6/8] RSS controller shares allocation

[snip]

>>> + for_each_child(child, res->rgroup) {
>>> + child_res = get_memctlr(child);
>>> + BUG_ON(!child_res);
>>> + recalc_and_propagate(child_res, res);
>> Recursion? Won't it eat all the stack in case of a deep tree?
>
> The depth of the hierarchy can be controlled. Recursion is needed
> to do a DFS walk

That's another point against recursion - bad root can
crash the kernel... If we are about to give container's
users ability to make their own subtrees then we *must*
avoid recursion. There's an algorithm that allows one
to walk the tree like this w/o recursion.

[snip]

>> I didn't find where in this patches this callback is called.
>
> It's a part of the resource groups infrastructure. It's been ported
> on top of Paul Menage's containers patches. The code can be easily
> adapted to work directly with containers instead of resource groups
> if required.


Could you please give me a link to the patch where this
is called?

2006-11-10 12:43:27

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 8/8] RSS controller support reclamation

Pavel Emelianov wrote:
> Balbir Singh wrote:
>
> [snip]
>
>>> And what about a hard limit - how would you fail in page fault in
>>> case of limit hit? SIGKILL/SEGV is not an option - in this case we
>>> should run synchronous reclamation. This is done in beancounter
>>> patches v6 we've sent recently.
>>>
>> I thought about running synchronous reclamation, but then did not follow
>> that approach, I was not sure if calling the reclaim routines from the
>> page fault context is a good thing to do. It's worth trying out, since
>
> Each page fault potentially calls reclamation by allocating
> required page with __GFP_IO | __GFP_FS bits set. Synchronous
> reclamation in page fault is really normal.

True. I don't know what I was thinking, thanks for making me think
straight.

>
> [snip]
>
>>> Please correct me if I'm wrong, but does this reclamation work like
>>> "run over all the zones' lists searching for page whose controller
>>> is sc->container" ?
>>>
>> Yeah, that's correct. The code can also reclaim memory from all over-the-limit
>
> OK. What if I have a container with 100 pages limit in a 4Gb
> (~ million of pages) machine and this group starts reclaiming
> its pages. In case this group uses its pages heavily they will
> be at the beginning of an LRU list and reclamation code would
> have to scan through all (million) pages before it finds proper
> ones. This is not optimal!
>

Yes, thats possible. The trade off is between

The cost associated with traversing that list while reclaiming
and the complexity associated with task migration. If we keep
a per-container list of pages, during task migration, you'll have
to migrate pages (of the task) from the list to the new container.

>> containers (by passing SC_OVERLIMIT_ALL). The idea behind using such a scheme
>> is to ensure that the global LRU list is not broken.
>
> isolate_lru_pages() helps in this. As far as I remember this
> was introduced to reduce lru lock contention and keep lru
> lists integrity.
>
> In beancounters patches this is used to shrink BC's pages.

I'll look at isolate_lru_pages() to see if the reclaim can be optimized.

Thanks for your feedback,


--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-10 12:56:11

by Balbir Singh

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC][PATCH 6/8] RSS controller shares allocation

Pavel Emelianov wrote:
> [snip]
>
>>>> + for_each_child(child, res->rgroup) {
>>>> + child_res = get_memctlr(child);
>>>> + BUG_ON(!child_res);
>>>> + recalc_and_propagate(child_res, res);
>>> Recursion? Won't it eat all the stack in case of a deep tree?
>> The depth of the hierarchy can be controlled. Recursion is needed
>> to do a DFS walk
>
> That's another point against recursion - bad root can
> crash the kernel... If we are about to give container's
> users ability to make their own subtrees then we *must*
> avoid recursion. There's an algorithm that allows one
> to walk the tree like this w/o recursion.

Bad pointers are always bad, whether they are the root or
any other pointer. Tree traversal is a generic infrastructure issue
for any infrastructure that supports a hierarchy.

Are you talking about threaded trees? Yes, they can be traversed
without recursion. I need to recheck my DS reference to double
check.

>
> [snip]
>
>>> I didn't find where in this patches this callback is called.
>> It's a part of the resource groups infrastructure. It's been ported
>> on top of Paul Menage's containers patches. The code can be easily
>> adapted to work directly with containers instead of resource groups
>> if required.
>
>
> Could you please give me a link to the patch where this
> is called?

Please see

http://www.mail-archive.com/[email protected]/msg03333.html

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-15 11:59:46

by Patrick.Le-Dot

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/8] RSS controller task migration support

Hi Balbir,

The get_task_mm()/mmput(mm) usage is not correct.
With CONFIG_DEBUG_SPINLOCK_SLEEP=y :

BUG: sleeping function called from invalid context at kernel/fork.c:390
in_atomic():1, irqs_disabled():0
[<c0116620>] __might_sleep+0x97/0x9c
[<c0116a2e>] mmput+0x15/0x8b
[<c01582f6>] install_arg_page+0x72/0xa9
[<c01584b1>] setup_arg_pages+0x184/0x1a5
...

BUG: sleeping function called from invalid context at kernel/fork.c:390
in_atomic():1, irqs_disabled():0
[<c0116620>] __might_sleep+0x97/0x9c
[<c0116a2e>] mmput+0x15/0x8b
[<c01468ee>] do_no_page+0x255/0x2bd
[<c0146b8d>] __handle_mm_fault+0xed/0x1ef
[<c0111884>] do_page_fault+0x247/0x506
[<c011163d>] do_page_fault+0x0/0x506
[<c0348f99>] error_code+0x39/0x40


current->mm seems to be enough here.



In patch4, memctlr_dec_rss(page, mm) should be memctlr_dec_rss(page)
to compile correctly.

and in patch0 :
> 4. Disable cpuset's (to simply assignment of tasks to resource groups)
> cd /container
> echo 0 > cpuset_enabled

should be :
echo 0 > cpuacct_enabled

Note : cpuacct_enabled is 0 by default.


Now the big question : to implement guarantee, the LRU needs to know
if a page can be removed from memory or not.
Any ideas to do that without any change in the struct page ?

Patrick

+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Patrick Le Dot
mailto: [email protected]@bull.net Centre UNIX de BULL SAS
Phone : +33 4 76 29 73 20 1, Rue de Provence BP 208
Fax : +33 4 76 29 76 00 38130 ECHIROLLES Cedex FRANCE
Bull, Architect of an Open World TM
http://www.bull.com

2006-11-15 16:37:53

by Balbir Singh

[permalink] [raw]
Subject: Re: [ckrm-tech] [RFC][PATCH 5/8] RSS controller task migration support

Patrick.Le-Dot wrote:
> Hi Balbir,
>
> The get_task_mm()/mmput(mm) usage is not correct.
> With CONFIG_DEBUG_SPINLOCK_SLEEP=y :
>
> BUG: sleeping function called from invalid context at kernel/fork.c:390
> in_atomic():1, irqs_disabled():0
> [<c0116620>] __might_sleep+0x97/0x9c
> [<c0116a2e>] mmput+0x15/0x8b
> [<c01582f6>] install_arg_page+0x72/0xa9
> [<c01584b1>] setup_arg_pages+0x184/0x1a5
> ...
>
> BUG: sleeping function called from invalid context at kernel/fork.c:390
> in_atomic():1, irqs_disabled():0
> [<c0116620>] __might_sleep+0x97/0x9c
> [<c0116a2e>] mmput+0x15/0x8b
> [<c01468ee>] do_no_page+0x255/0x2bd
> [<c0146b8d>] __handle_mm_fault+0xed/0x1ef
> [<c0111884>] do_page_fault+0x247/0x506
> [<c011163d>] do_page_fault+0x0/0x506
> [<c0348f99>] error_code+0x39/0x40
>
>
> current->mm seems to be enough here.

Excellent, thanks for catching this!

>
>
>
> In patch4, memctlr_dec_rss(page, mm) should be memctlr_dec_rss(page)
> to compile correctly.
>
> and in patch0 :
>> 4. Disable cpuset's (to simply assignment of tasks to resource groups)
>> cd /container
>> echo 0 > cpuset_enabled
>
> should be :
> echo 0 > cpuacct_enabled
>
> Note : cpuacct_enabled is 0 by default.
>

Thanks for pointing this out.

>
> Now the big question : to implement guarantee, the LRU needs to know
> if a page can be removed from memory or not.
> Any ideas to do that without any change in the struct page ?
>

For implementing guarantees, we can use limits. Please see
http://wiki.openvz.org/Containers/Guarantees_for_resources.


Thanks for the feedback!

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-11-21 10:01:56

by Patrick.Le-Dot

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/8] RSS controller task migration support

On Fri, 17 Nov 2006 22:04:08 +0530
> ...
> I am not against guarantees, but
>
> Consider the following scenario, let's say we implement guarantees
>
> 1. If we account for kernel resources, how do you provide guarantees
> when you have non-reclaimable resources?

First, the current patch is based only on pages available in the
struct mm.
I doubt that these pages are "non-reclaimable"...

And guarantee should be ignored just because some kernel resources
are marked "non-reclaimable" ?


> 2. If a customer runs a system with swap turned off (which is quite
> common),

quite common, really ?

> then anonymous memory becomes irreclaimable. If a group
> takes more than it's fair share (exceeds its guarantee), you
> have scenario similar to 1 above.

That seems to be just a subset of the "guarantee+limit" model : if
guarantee is not useful for you, don't use it.

I'm not saying that guarantee should be a magic piece of code working
for everybody.

But we have to propose something for the customers who ask for a
guarantee (ie using a system with swap turned on like me and this is
quite common:-)

Patrick

+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Patrick Le Dot
mailto: [email protected]@bull.net Centre UNIX de BULL SAS
Phone : +33 4 76 29 73 20 1, Rue de Provence BP 208
Fax : +33 4 76 29 76 00 38130 ECHIROLLES Cedex FRANCE
Bull, Architect of an Open World TM
http://www.bull.com

2006-11-21 11:07:22

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/8] RSS controller task migration support

Patrick.Le-Dot wrote:
> On Fri, 17 Nov 2006 22:04:08 +0530
>> ...
>> I am not against guarantees, but
>>
>> Consider the following scenario, let's say we implement guarantees
>>
>> 1. If we account for kernel resources, how do you provide guarantees
>> when you have non-reclaimable resources?
>
> First, the current patch is based only on pages available in the
> struct mm.
> I doubt that these pages are "non-reclaimable"...

I am speaking of a scenario when we start supporting kernel accounting
and of-course the swapless case.

>
> And guarantee should be ignored just because some kernel resources
> are marked "non-reclaimable" ?
>

Ok.. but can you have a consistent guarantee definition with un-reclaimable
kernel resources? How do you define a guarantee in a consistent manner?
In my discussions earlier on lkml, I had suggested that we define guarantee
only for reclaimable resources and provide support only for them.

>
>> 2. If a customer runs a system with swap turned off (which is quite
>> common),
>
> quite common, really ?

Yep, I was listening to a talk from a customer service expert and he
mentioned that it's used to boost performance.

>
>> then anonymous memory becomes irreclaimable. If a group
>> takes more than it's fair share (exceeds its guarantee), you
>> have scenario similar to 1 above.
>
> That seems to be just a subset of the "guarantee+limit" model : if
> guarantee is not useful for you, don't use it.
>
> I'm not saying that guarantee should be a magic piece of code working
> for everybody.
>
> But we have to propose something for the customers who ask for a
> guarantee (ie using a system with swap turned on like me and this is
> quite common:-)
>

Like I said I am not against guarantees, but do we have to implement
them in our first iteration?


> Patrick
>


--

Balbir Singh,
Linux Technology Center,
IBM Software Labs